论文信息 - Partially Observable Markov Decision Processes

Partially Observable Markov Decision Processes

For reinforcement learning in environments in which an agent has access to a reliable state signal, methods based on the Markov decision process (MDP) have had many successes. In many problem domains, however, an agent suffers from limited sensing capabilities that preclude it from recovering a Markovian state signal from its perceptions. Extending the MDP framework, partially observable Markov decision processes (POMDPs) allow for principled decision making under conditions of uncertain sensing. In this chapter we present the POMDP model by focusing on the differences with fully observable MDPs, and we show how optimal policies for POMDPs can be represented. Next, we give a review of model-based techniques for policy computation, followed by an overview of the available model-free methods for POMDPs. We conclude by highlighting recent trends in POMDP reinforcement learning.

Matthijs T. J. Spaan | M. Spaan | Pascal Poupart

[1] Leslie Pack Kaelbling,et al. Learning Policies for Partially Observable Environments: Scaling Up , 1997, ICML.

[2] John N. Tsitsiklis,et al. The Complexity of Markov Decision Processes , 1987, Math. Oper. Res..

[3] Panos E. Trahanias,et al. Real-time hierarchical POMDPs for autonomous robot navigation , 2007, Robotics Auton. Syst..

[4] Joel Veness,et al. Monte-Carlo Planning in Large POMDPs , 2010, NIPS.

[5] G. Monahan. State of the Art—A Survey of Partially Observable Markov Decision Processes: Theory, Models, and Algorithms , 1982 .

[6] Jesse Hoey,et al. Solving POMDPs with Continuous or Large Discrete Observation Spaces , 2005, IJCAI.

[7] Craig Boutilier,et al. Stochastic Local Search for POMDP Controllers , 2004, AAAI.

[8] Neil Immerman,et al. The Complexity of Decentralized Control of Markov Decision Processes , 2000, UAI.

[9] Michael L. Littman,et al. Memoryless policies: theoretical limitations and practical results , 1994 .

[10] Shlomo Zilberstein,et al. Finite-memory control of partially observable systems , 1998 .

[11] Nikos A. Vlassis,et al. Robot Planning in Partially Observable Continuous Domains , 2005, BNAIC.

[12] Alex Pentland,et al. Active gesture recognition using partially observable Markov decision processes , 1996, Proceedings of 13th International Conference on Pattern Recognition.

[13] Craig Boutilier,et al. Symbolic Dynamic Programming for First-Order MDPs , 2001, IJCAI.

[14] Steven L. Shafer,et al. Comparison of Some Suboptimal Control Policies in Medical Drug Therapy , 1996, Oper. Res..

[15] Joelle Pineau,et al. An integrated approach to hierarchy and abstraction for pomdps , 2002 .

[16] W. Burgard,et al. Markov Localization for Mobile Robots in Dynamic Environments , 1999, J. Artif. Intell. Res..

[17] Joelle Pineau,et al. Active Learning in Partially Observable Markov Decision Processes , 2005, ECML.

[18] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.

[19] Peter Vrancx,et al. Reinforcement Learning: State-of-the-Art , 2012 .

[20] Yossi Aviv,et al. A Partially Observed Markov Decision Process for Dynamic Pricing , 2005, Manag. Sci..

[21] Chelsea C. White,et al. A survey of solution techniques for the partially observed Markov decision process , 1991, Ann. Oper. Res..

[22] Nicholas Roy,et al. The permutable POMDP: fast solutions to POMDPs for preference elicitation , 2008, AAMAS.

[23] Zhengzhu Feng,et al. Dynamic Programming for POMDPs Using a Factored State Representation , 2000, AIPS.

[24] Leslie Pack Kaelbling,et al. Planning and Acting in Partially Observable Stochastic Domains , 1998, Artif. Intell..

[25] Blai Bonet,et al. An epsilon-Optimal Grid-Based Algorithm for Partially Observable Markov Decision Processes , 2002, ICML.

[26] Stuart J. Russell,et al. Approximating Optimal Policies for Partially Observable Stochastic Domains , 1995, IJCAI.

[27] Pascal Poupart,et al. Point-Based Value Iteration for Continuous POMDPs , 2006, J. Mach. Learn. Res..

[28] Anne Condon,et al. On the undecidability of probabilistic planning and related stochastic optimization problems , 2003, Artif. Intell..

[29] Shlomo Zilberstein,et al. Formal models and algorithms for decentralized decision making under uncertainty , 2008, Autonomous Agents and Multi-Agent Systems.

[30] Andrew W. Moore,et al. Gradient Descent for General Reinforcement Learning , 1998, NIPS.

[31] Nicholas Roy,et al. Exponential Family PCA for Belief Compression in POMDPs , 2002, NIPS.

[32] Craig Boutilier,et al. Value-Directed Compression of POMDPs , 2002, NIPS.

[33] Milos Hauskrecht,et al. Value-Function Approximations for Partially Observable Markov Decision Processes , 2000, J. Artif. Intell. Res..

[34] E. Dynkin. Controlled Random Sequences , 1965 .

[35] Hsien-Te Cheng,et al. Algorithms for partially observable markov decision processes , 1989 .

[36] Sebastian Thrun,et al. Probabilistic robotics , 2002, CACM.

[37] Andrew McCallum,et al. Reinforcement learning with selective perception and hidden state , 1996 .

[38] Sebastian Thrun,et al. Monte Carlo POMDPs , 1999, NIPS.

[39] Eric A. Hansen,et al. An Improved Grid-Based Approximation Algorithm for POMDPs , 2001, IJCAI.

[40] Reid G. Simmons,et al. Heuristic Search Value Iteration for POMDPs , 2004, UAI.

[41] Steve J. Young,et al. Partially observable Markov decision processes for spoken dialog systems , 2007, Comput. Speech Lang..

[42] Milos Hauskrecht,et al. Planning treatment of ischemic heart disease with partially observable Markov decision processes , 2000, Artif. Intell. Medicine.

[43] Ross B. Corotis,et al. INSPECTION, MAINTENANCE, AND REPAIR WITH PARTIAL OBSERVABILITY , 1995 .

[44] Marco Wiering,et al. Utile distinction hidden Markov models , 2004, ICML.

[45] Joelle Pineau,et al. Spoken Dialog Management for Robots , 2000, ACL 2000.

[46] A. Yezzi,et al. Local or Global Minima: Flexible Dual-Front Active Contours , 2007 .

[47] Joelle Pineau,et al. Online Planning Algorithms for POMDPs , 2008, J. Artif. Intell. Res..

[48] S. Nanda. Mathematical Analysis and Applications , 2004 .

[49] Nikos A. Vlassis,et al. Perseus: Randomized Point-based Value Iteration for POMDPs , 2005, J. Artif. Intell. Res..

[50] Richard S. Sutton,et al. Predictive Representations of State , 2001, NIPS.

[51] Michael R. James,et al. Predictive State Representations: A New Theory for Modeling Dynamical Systems , 2004, UAI.

[52] Edward J. Sondik,et al. The optimal control of par-tially observable Markov processes , 1971 .

[53] Douglas Aberdeen,et al. Scalable Internal-State Policy-Gradient Methods for POMDPs , 2002, ICML.

[54] Peter L. Bartlett,et al. Infinite-Horizon Policy-Gradient Estimation , 2001, J. Artif. Intell. Res..

[55] E. J. Sondik,et al. The Optimal Control of Partially Observable Markov Decision Processes. , 1971 .

[56] Karl Johan Åström,et al. Optimal control of Markov processes with incomplete state information , 1965 .

[57] Jan Peters. Policy gradient methods , 2010, Scholarpedia.

[58] Marc Toussaint,et al. Model-free reinforcement learning as mixture learning , 2009, ICML '09.

[59] Michael I. Jordan,et al. Reinforcement Learning Algorithm for Partially Observable Markov Decision Problems , 1994, NIPS.

[60] Scott Sanner,et al. Symbolic Dynamic Programming for First-order POMDPs , 2010, AAAI.

[61] Craig Boutilier,et al. Bounded Finite State Controllers , 2003, NIPS.

[62] Alvin W Drake,et al. Observation of a Markov process through a noisy channel , 1962 .

[63] Joelle Pineau,et al. Towards robotic assistants in nursing homes: Challenges and results , 2003, Robotics Auton. Syst..

[64] Stefan Schaal,et al. Natural Actor-Critic , 2003, Neurocomputing.

[65] Jeff G. Schneider,et al. Policy Search by Dynamic Programming , 2003, NIPS.

[66] Andrew G. Barto,et al. Optimal learning: computational procedures for bayes-adaptive markov decision processes , 2002 .

[67] John Loch,et al. Using Eligibility Traces to Find the Best Memoryless Policy in Partially Observable Markov Decision Processes , 1998, ICML.

[68] Nikos A. Vlassis,et al. Optimal and Approximate Q-value Functions for Decentralized POMDPs , 2008, J. Artif. Intell. Res..

[69] Peter L. Bartlett,et al. Experiments with Infinite-Horizon, Policy-Gradient Estimation , 2001, J. Artif. Intell. Res..

[70] Nikos A. Vlassis,et al. A point-based POMDP algorithm for robot planning , 2004, IEEE International Conference on Robotics and Automation, 2004. Proceedings. ICRA '04. 2004.

[71] Michael I. Jordan,et al. Learning Without State-Estimation in Partially Observable Markovian Decision Processes , 1994, ICML.

[72] C. R. Sox,et al. Adaptive Inventory Control for Nonstationary Demand and Partial Information , 2002, Manag. Sci..

[73] Jesse Hoey,et al. A Decision-Theoretic Approach to Task Assistance for Persons with Dementia , 2005, IJCAI.

[74] Anthony R. Cassandra,et al. Development and Evaluation of a Bayesian Low-Vision Navigation Aid , 2007, IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans.

[75] William S. Lovejoy,et al. Computationally Feasible Bounds for Partially Observed Markov Decision Processes , 1991, Oper. Res..

[76] Pedro U. Lima,et al. Active cooperative perception in network robot systems using POMDPs , 2010, 2010 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[77] Wenju Liu,et al. Planning in Stochastic Domains: Problem Characteristics and Approximation , 1996 .

[78] Chelsea C. White,et al. A Hybrid Genetic/Optimization Algorithm for Finite-Horizon, Partially Observed Markov Decision Processes , 2004, INFORMS J. Comput..

[79] Eric A. Hansen,et al. Solving POMDPs by Searching in Policy Space , 1998, UAI.

[80] Roni Khardon,et al. Relational Partially Observable MDPs , 2010, AAAI.

[81] Andrew McCallum,et al. Instance-Based Utile Distinctions for Reinforcement Learning , 1995 .

[82] Michael L. Littman,et al. Incremental Pruning: A Simple, Fast, Exact Method for Partially Observable Markov Decision Processes , 1997, UAI.