论文信息 - Learning and Solving Partially Observable Markov Decision Processes

Learning and Solving Partially Observable Markov Decision Processes

Partially Observable Markov Decision Processes (POMDPs) provide a rich representation for agents acting in a stochastic domain under partial observability. POMDPs optimally balance key properties such as the need for information and the sum of collected rewards. However, POMDPs are difficult to use for two reasons; first, it is difficult to obtain the environment dynamics and second, even given the environment dynamics, solving POMDPs optimally is intractable. This dissertation deals with both difficulties. We begin with a number of methods for learning POMDPs. Methods for learning POMDPs are usually categorized as either model-free or model-based. We show how model-free methods fail to provide good policies as noise in the environment increases. We continue to suggest how to transform model-free into model-based methods, thus improving their solution. This transformation is first demonstrated in an offline process — after the model-free method has computed a policy, and then in an online setting — where a model of the environment is learned together with a policy through interactions with the environment. The second part of the dissertation focuses on ways to solve predefined POMDPs. Pointbased methods for computing value functions have shown a great potential for solving large scale POMDPs. We provide a number of new algorithms that outperform existing point-based methods. We first show how properly ordering the value function updates can greatly reduce the required number of updates. We then present a trial-based algorithm that outperforms all current point-based algorithms. Due to the success of point-based algorithms on large domains, a need arises for compact representations of the environment. We thoroughly investigate the use of Algebraic Decision Diagrams (ADDs) for representing system dynamics. We show how all operations required for point-based algorithms can be implemented efficiently using ADDs.

Guy Shani | Guy Shani

[1] Craig Boutilier,et al. Computing Optimal Policies for Partially Observable Decision Processes Using Compact Representations , 1996, AAAI/IAAI, Vol. 2.

[2] Michael L. Littman,et al. Planning with predictive state representations , 2004, 2004 International Conference on Machine Learning and Applications, 2004. Proceedings..

[3] Michael R. James,et al. Learning and discovery of predictive state representations in dynamical systems with reset , 2004, ICML.

[4] Aude Billard,et al. From Animals to Animats , 2004 .

[5] Martin L. Puterman,et al. Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[6] Andrew McCallum,et al. Instance-Based State Identification for Reinforcement Learning , 1994, NIPS.

[7] Sebastian Thrun,et al. Learning low dimensional predictive representations , 2004, ICML.

[8] Jürgen Schmidhuber,et al. HQ-Learning , 1997, Adapt. Behav..

[9] Reid G. Simmons,et al. Point-Based POMDP Algorithms: Improved Analysis and Implementation , 2005, UAI.

[10] Jesse Hoey,et al. An analytic solution to discrete Bayesian reinforcement learning , 2006, ICML.

[11] Satinder P. Singh,et al. Experimental Results on Learning Stochastic Memoryless Policies for Partially Observable Markov Decision Processes , 1998, NIPS.

[12] Guy Shani,et al. Model-Based Online Learning of POMDPs , 2005, ECML.

[13] Andrew McCallum,et al. Overcoming Incomplete Perception with Utile Distinction Memory , 1993, ICML.

[14] Hagit Shatkay,et al. Learning Hidden Markov Models with Geometrical Constraints , 1999, UAI.

[15] John N. Tsitsiklis,et al. The Complexity of Markov Decision Processes , 1987, Math. Oper. Res..

[16] Nikos A. Vlassis,et al. Perseus: Randomized Point-based Value Iteration for POMDPs , 2005, J. Artif. Intell. Res..

[17] Guy Shani,et al. Prioritizing Point-Based POMDP Solvers , 2006, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[18] Kevin D. Seppi,et al. Prioritization Methods for Accelerating MDP Solvers , 2005, J. Mach. Learn. Res..

[19] Richard S. Sutton,et al. Predictive Representations of State , 2001, NIPS.

[20] Michael R. James,et al. Predictive State Representations: A New Theory for Modeling Dynamical Systems , 2004, UAI.

[21] Richard S. Sutton,et al. Planning by Incremental Dynamic Programming , 1991, ML.

[22] John Loch,et al. Using Eligibility Traces to Find the Best Memoryless Policy in Partially Observable Markov Decision Processes , 1998, ICML.

[23] Ronald A. Howard,et al. Dynamic Programming and Markov Processes , 1960 .

[24] Hector Geffner,et al. Solving Large POMDPs using Real Time Dynamic Programming , 1998 .

[25] Richard Washington,et al. BI-POMDP: Bounded, Incremental, Partially-Observable Markov-Model Planning , 1997, ECP.

[26] Jeff A. Bilmes,et al. A gentle tutorial of the em algorithm and its application to parameter estimation for Gaussian mixture and hidden Markov models , 1998 .

[27] Sean R Eddy,et al. What is dynamic programming? , 2004, Nature Biotechnology.

[28] Xavier Boyen,et al. Tractable Inference for Complex Stochastic Processes , 1998, UAI.

[29] Michael L. Littman,et al. Incremental Pruning: A Simple, Fast, Exact Method for Partially Observable Markov Decision Processes , 1997, UAI.

[30] Tom M. Mitchell,et al. Reinforcement learning with hidden states , 1993 .

[31] Andrew G. Barto,et al. Learning to Act Using Real-Time Dynamic Programming , 1995, Artif. Intell..

[32] Craig Boutilier,et al. Bounded Finite State Controllers , 2003, NIPS.

[33] Ronald E. Parr,et al. Solving Factored POMDPs with Linear Value Functions , 2001 .

[34] Brahim Chaib-draa,et al. An online POMDP algorithm for complex multiagent environments , 2005, AAMAS '05.

[35] Craig Boutilier,et al. Value-Directed Compression of POMDPs , 2002, NIPS.

[36] William S. Lovejoy,et al. Computationally Feasible Bounds for Partially Observed Markov Decision Processes , 1991, Oper. Res..

[37] Andrew McCallum,et al. Reinforcement learning with selective perception and hidden state , 1996 .

[38] Joelle Pineau,et al. Anytime Point-Based Approximations for Large POMDPs , 2006, J. Artif. Intell. Res..

[39] Brahim Chaib-draa,et al. AEMS: An Anytime Online Search Algorithm for Approximate Policy Refinement in Large POMDPs , 2007, IJCAI.

[40] Joelle Pineau,et al. Point-based value iteration: An anytime algorithm for POMDPs , 2003, IJCAI.

[41] L. Baird. Reinforcement Learning Through Gradient Descent , 1999 .

[42] Eric A. Hansen,et al. An Improved Grid-Based Approximation Algorithm for POMDPs , 2001, IJCAI.

[43] Doina Precup,et al. Belief Selection in Point-Based Planning Algorithms for POMDPs , 2006, Canadian Conference on AI.

[44] Enrico Macii,et al. Algebraic decision diagrams and their applications , 1993, Proceedings of 1993 International Conference on Computer Aided Design (ICCAD).

[45] Akira Hayashi,et al. Viewing Classifier Systems as Model Free Learning in POMDPs , 1998, NIPS.

[46] Guy Shani,et al. Resolving Perceptual Aliasing In The Presence Of Noisy Sensors , 2004, NIPS.

[47] Peter Stone,et al. Learning Predictive State Representations , 2003, ICML.

[48] Illah R. Nourbakhsh,et al. Learning Probabilistic Models for Decision-Theoretic Navigation of Mobile Robots , 2000, ICML.

[49] Richard S. Sutton,et al. Reinforcement Learning with Replacing Eligibility Traces , 2005, Machine Learning.

[50] Craig Boutilier,et al. Stochastic Local Search for POMDP Controllers , 2004, AAAI.

[51] Leslie Pack Kaelbling,et al. Learning Policies with External Memory , 1999, ICML.

[52] Daniel Nikovski,et al. State-aggregation algorithms for learning probabilistic models for robot control , 2002 .

[53] Eric A. Hansen,et al. Solving POMDPs by Searching in Policy Space , 1998, UAI.

[54] Milos Hauskrecht,et al. Value-Function Approximations for Partially Observable Markov Decision Processes , 2000, J. Artif. Intell. Res..

[55] M. Littman,et al. Efficient dynamic-programming updates in partially observable Markov decision processes , 1995 .

[56] Akira Hayashi,et al. A Bayesian Approach to Model Learning in Non-Markovian Environments , 1997, ICML.