Learning and Solving Partially Observable Markov Decision Processes

Partially Observable Markov Decision Processes (POMDPs) provide a rich representation for agents acting in a stochastic domain under partial observability. POMDPs optimally balance key properties such as the need for information and the sum of collected rewards. However, POMDPs are difficult to use for two reasons; first, it is difficult to obtain the environment dynamics and second, even given the environment dynamics, solving POMDPs optimally is intractable. This dissertation deals with both difficulties. We begin with a number of methods for learning POMDPs. Methods for learning POMDPs are usually categorized as either model-free or model-based. We show how model-free methods fail to provide good policies as noise in the environment increases. We continue to suggest how to transform model-free into model-based methods, thus improving their solution. This transformation is first demonstrated in an offline process — after the model-free method has computed a policy, and then in an online setting — where a model of the environment is learned together with a policy through interactions with the environment. The second part of the dissertation focuses on ways to solve predefined POMDPs. Pointbased methods for computing value functions have shown a great potential for solving large scale POMDPs. We provide a number of new algorithms that outperform existing point-based methods. We first show how properly ordering the value function updates can greatly reduce the required number of updates. We then present a trial-based algorithm that outperforms all current point-based algorithms. Due to the success of point-based algorithms on large domains, a need arises for compact representations of the environment. We thoroughly investigate the use of Algebraic Decision Diagrams (ADDs) for representing system dynamics. We show how all operations required for point-based algorithms can be implemented efficiently using ADDs.

[1]  Craig Boutilier,et al.  Computing Optimal Policies for Partially Observable Decision Processes Using Compact Representations , 1996, AAAI/IAAI, Vol. 2.

[2]  Michael L. Littman,et al.  Planning with predictive state representations , 2004, 2004 International Conference on Machine Learning and Applications, 2004. Proceedings..

[3]  Michael R. James,et al.  Learning and discovery of predictive state representations in dynamical systems with reset , 2004, ICML.

[4]  Aude Billard,et al.  From Animals to Animats , 2004 .

[5]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[6]  Andrew McCallum,et al.  Instance-Based State Identification for Reinforcement Learning , 1994, NIPS.

[7]  Sebastian Thrun,et al.  Learning low dimensional predictive representations , 2004, ICML.

[8]  Jürgen Schmidhuber,et al.  HQ-Learning , 1997, Adapt. Behav..

[9]  Reid G. Simmons,et al.  Point-Based POMDP Algorithms: Improved Analysis and Implementation , 2005, UAI.

[10]  Jesse Hoey,et al.  An analytic solution to discrete Bayesian reinforcement learning , 2006, ICML.

[11]  Satinder P. Singh,et al.  Experimental Results on Learning Stochastic Memoryless Policies for Partially Observable Markov Decision Processes , 1998, NIPS.

[12]  Guy Shani,et al.  Model-Based Online Learning of POMDPs , 2005, ECML.

[13]  Andrew McCallum,et al.  Overcoming Incomplete Perception with Utile Distinction Memory , 1993, ICML.

[14]  Hagit Shatkay,et al.  Learning Hidden Markov Models with Geometrical Constraints , 1999, UAI.

[15]  John N. Tsitsiklis,et al.  The Complexity of Markov Decision Processes , 1987, Math. Oper. Res..

[16]  Nikos A. Vlassis,et al.  Perseus: Randomized Point-based Value Iteration for POMDPs , 2005, J. Artif. Intell. Res..

[17]  Guy Shani,et al.  Prioritizing Point-Based POMDP Solvers , 2006, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[18]  Kevin D. Seppi,et al.  Prioritization Methods for Accelerating MDP Solvers , 2005, J. Mach. Learn. Res..

[19]  Richard S. Sutton,et al.  Predictive Representations of State , 2001, NIPS.

[20]  Michael R. James,et al.  Predictive State Representations: A New Theory for Modeling Dynamical Systems , 2004, UAI.

[21]  Richard S. Sutton,et al.  Planning by Incremental Dynamic Programming , 1991, ML.

[22]  John Loch,et al.  Using Eligibility Traces to Find the Best Memoryless Policy in Partially Observable Markov Decision Processes , 1998, ICML.

[23]  Ronald A. Howard,et al.  Dynamic Programming and Markov Processes , 1960 .

[24]  Hector Geffner,et al.  Solving Large POMDPs using Real Time Dynamic Programming , 1998 .

[25]  Richard Washington,et al.  BI-POMDP: Bounded, Incremental, Partially-Observable Markov-Model Planning , 1997, ECP.

[26]  Jeff A. Bilmes,et al.  A gentle tutorial of the em algorithm and its application to parameter estimation for Gaussian mixture and hidden Markov models , 1998 .

[27]  Sean R Eddy,et al.  What is dynamic programming? , 2004, Nature Biotechnology.

[28]  Xavier Boyen,et al.  Tractable Inference for Complex Stochastic Processes , 1998, UAI.

[29]  Michael L. Littman,et al.  Incremental Pruning: A Simple, Fast, Exact Method for Partially Observable Markov Decision Processes , 1997, UAI.

[30]  Tom M. Mitchell,et al.  Reinforcement learning with hidden states , 1993 .

[31]  Andrew G. Barto,et al.  Learning to Act Using Real-Time Dynamic Programming , 1995, Artif. Intell..

[32]  Craig Boutilier,et al.  Bounded Finite State Controllers , 2003, NIPS.

[33]  Ronald E. Parr,et al.  Solving Factored POMDPs with Linear Value Functions , 2001 .

[34]  Brahim Chaib-draa,et al.  An online POMDP algorithm for complex multiagent environments , 2005, AAMAS '05.

[35]  Craig Boutilier,et al.  Value-Directed Compression of POMDPs , 2002, NIPS.

[36]  William S. Lovejoy,et al.  Computationally Feasible Bounds for Partially Observed Markov Decision Processes , 1991, Oper. Res..

[37]  Andrew McCallum,et al.  Reinforcement learning with selective perception and hidden state , 1996 .

[38]  Joelle Pineau,et al.  Anytime Point-Based Approximations for Large POMDPs , 2006, J. Artif. Intell. Res..

[39]  Brahim Chaib-draa,et al.  AEMS: An Anytime Online Search Algorithm for Approximate Policy Refinement in Large POMDPs , 2007, IJCAI.

[40]  Joelle Pineau,et al.  Point-based value iteration: An anytime algorithm for POMDPs , 2003, IJCAI.

[41]  L. Baird Reinforcement Learning Through Gradient Descent , 1999 .

[42]  Eric A. Hansen,et al.  An Improved Grid-Based Approximation Algorithm for POMDPs , 2001, IJCAI.

[43]  Doina Precup,et al.  Belief Selection in Point-Based Planning Algorithms for POMDPs , 2006, Canadian Conference on AI.

[44]  Enrico Macii,et al.  Algebraic decision diagrams and their applications , 1993, Proceedings of 1993 International Conference on Computer Aided Design (ICCAD).

[45]  Akira Hayashi,et al.  Viewing Classifier Systems as Model Free Learning in POMDPs , 1998, NIPS.

[46]  Guy Shani,et al.  Resolving Perceptual Aliasing In The Presence Of Noisy Sensors , 2004, NIPS.

[47]  Peter Stone,et al.  Learning Predictive State Representations , 2003, ICML.

[48]  Illah R. Nourbakhsh,et al.  Learning Probabilistic Models for Decision-Theoretic Navigation of Mobile Robots , 2000, ICML.

[49]  Richard S. Sutton,et al.  Reinforcement Learning with Replacing Eligibility Traces , 2005, Machine Learning.

[50]  Craig Boutilier,et al.  Stochastic Local Search for POMDP Controllers , 2004, AAAI.

[51]  Leslie Pack Kaelbling,et al.  Learning Policies with External Memory , 1999, ICML.

[52]  Daniel Nikovski,et al.  State-aggregation algorithms for learning probabilistic models for robot control , 2002 .

[53]  Eric A. Hansen,et al.  Solving POMDPs by Searching in Policy Space , 1998, UAI.

[54]  Milos Hauskrecht,et al.  Value-Function Approximations for Partially Observable Markov Decision Processes , 2000, J. Artif. Intell. Res..

[55]  M. Littman,et al.  Efficient dynamic-programming updates in partially observable Markov decision processes , 1995 .

[56]  Akira Hayashi,et al.  A Bayesian Approach to Model Learning in Non-Markovian Environments , 1997, ICML.

[57]  Alexander S. Yeh,et al.  More accurate tests for the statistical significance of result differences , 2000, COLING.

[58]  S. E. Shimony,et al.  Partial Observability Under Noisy Sensors — From Model-Free to Model-Based , 2005 .

[59]  Joshua J. Estelle Reinforcement Learning in POMDPs : Instance-Based State Identification vs . Fixed Memory Representations , 2003 .

[60]  Michael L. Littman,et al.  Memoryless policies: theoretical limitations and practical results , 1994 .

[61]  李幼升,et al.  Ph , 1989 .

[62]  Leslie Pack Kaelbling,et al.  Planning and Acting in Partially Observable Stochastic Domains , 1998, Artif. Intell..

[63]  Stuart J. Russell,et al.  Approximating Optimal Policies for Partially Observable Stochastic Domains , 1995, IJCAI.

[64]  Sergey V. Alexandrov,et al.  Ratbert : Nearest Sequence Memory Based Prediction Model Applied to Robot Navigation , 2003 .

[65]  N. Zhang,et al.  Algorithms for partially observable markov decision processes , 2001 .

[66]  Reid G. Simmons,et al.  Heuristic Search Value Iteration for POMDPs , 2004, UAI.

[67]  D. Aberdeen,et al.  A ( Revised ) Survey of Approximate Methods for Solving Partially Observable Markov Decision Processes , 2003 .

[68]  Kee-Eung Kim,et al.  Solving POMDPs by Searching the Space of Finite Policies , 1999, UAI.

[69]  Leslie Pack Kaelbling,et al.  Acting Optimally in Partially Observable Stochastic Domains , 1994, AAAI.

[70]  Kee-Eung Kim,et al.  Learning Finite-State Controllers for Partially Observable Environments , 1999, UAI.

[71]  Long Lin,et al.  Memory Approaches to Reinforcement Learning in Non-Markovian Domains , 1992 .

[72]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[73]  Andrew W. Moore,et al.  Prioritized sweeping: Reinforcement learning with less data and less time , 2004, Machine Learning.

[74]  P. Lanzi,et al.  Adaptive Agents with Reinforcement Learning and Internal Memory , 2000 .

[75]  Marco Wiering,et al.  Utile distinction hidden Markov models , 2004, ICML.

[76]  Stuart J. Russell,et al.  Reinforcement Learning with Hierarchies of Machines , 1997, NIPS.

[77]  Andrew W. Moore,et al.  Reinforcement Learning: A Survey , 1996, J. Artif. Intell. Res..

[78]  Dana H. Ballard,et al.  Learning to perceive and act by trial and error , 1991, Machine Learning.

[79]  Randal E. Bryant,et al.  Graph-Based Algorithms for Boolean Function Manipulation , 1986, IEEE Transactions on Computers.

[80]  Leslie Pack Kaelbling,et al.  Learning Policies for Partially Observable Environments: Scaling Up , 1997, ICML.

[81]  Akira Hayashi,et al.  A Reinforcement Learning Algorithm in Partially Observable Environments Using Short-Term Memory , 1998, NIPS.

[82]  Lonnie Chrisman,et al.  Reinforcement Learning with Perceptual Aliasing: The Perceptual Distinctions Approach , 1992, AAAI.

[83]  Sridhar Mahadevan,et al.  Hierarchical Memory-Based Reinforcement Learning , 2000, NIPS.

[84]  Zhengzhu Feng,et al.  Dynamic Programming for POMDPs Using a Factored State Representation , 2000, AIPS.

[85]  Andrew W. Moore,et al.  Gradient Descent for General Reinforcement Learning , 1998, NIPS.

[86]  Jesse Hoey,et al.  Assisting persons with dementia during handwashing using a partially observable Markov decision process. , 2007, ICVS 2007.

[87]  Jesse Hoey,et al.  APRICODD: Approximate Policy Construction Using Decision Diagrams , 2000, NIPS.

[88]  Stuart J. Russell,et al.  Adaptive Probabilistic Networks with Hidden Variables , 1997, Machine Learning.

[89]  Edward J. Sondik,et al.  The optimal control of par-tially observable Markov processes , 1971 .

[90]  Michael I. Jordan,et al.  Reinforcement Learning Algorithm for Partially Observable Markov Decision Problems , 1994, NIPS.

[91]  Ronen I. Brafman,et al.  A Heuristic Variable Grid Solution Method for POMDPs , 1997, AAAI/IAAI.

[92]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.