An integrated approach to hierarchy and abstraction for pomdps

This paper presents an algorithm for planning in structured partially observable Markov Decision Processes (POMDPs). The new algorithm, named PolCA (for Pol icy-Contingent Abstraction) uses an action-based decomposition to partition complex POMDP problems into a hierarchy of smaller subproblems. Low-level subtasks are solved rst, and their partial policies are used to model abstract actions in the context of higher-level subtasks. At all levels of the hierarchy, subtasks need only consider a reduced action, state and observation space. The reduced action set is provided by a designer, whereas the reduced state and observations sets are discovered automatically on a subtask-per-subtask basis. This typically results in lower-level subtasks having few, but high-resolution, state/observations features, whereas high-level subtasks tend to have many, but low-resolution, state/observation features. This paper presents a detailed overview of PolCA in the context of a POMDP hierarchical planning and execution algorithm. It also includes theoretical results demonstrating that in the special case of fully observable MDPs, the algorithm converges to a recursively optimal solution. Experimental results included in the paper demonstrate the usefulness of the approach on a range of problems, and show favorable performance compared to competing function-approximation POMDP algorithms. Finally, the paper presents a real-world implementation and deployment of a robotic system which uses PolCA in the context of a high-level robot behavior control task.

[1]  Tucker R. Balch,et al.  Symmetry in Markov Decision Processes and its Implications for Single Agent and Multiagent Learning , 2001, ICML.

[2]  Sebastian Thrun,et al.  Coastal Navigation with Mobile Robots , 1999, NIPS.

[3]  Liqiang Feng,et al.  Navigating Mobile Robots: Systems and Techniques , 1996 .

[4]  David Andre,et al.  State abstraction for programmable reinforcement learning agents , 2002, AAAI/IAAI.

[5]  Ronen I. Brafman,et al.  Prioritized Goal Decomposition of Markov Decision Processes: Toward a Synthesis of Classical and Decision Theoretic Planning , 1997, IJCAI.

[6]  Craig Boutilier,et al.  Value-Directed Belief State Approximation for POMDPs , 2000, UAI.

[7]  Robert Givan,et al.  Model Reduction Techniques for Computing Approximately Optimal Solutions for Markov Decision Processes , 1997, UAI.

[8]  Mark C. Torrance,et al.  Natural communication with robots , 1994 .

[9]  Xavier Boyen,et al.  Tractable Inference for Complex Stochastic Processes , 1998, UAI.

[10]  Ronald C. Arkin,et al.  An Behavior-based Robotics , 1998 .

[11]  Thomas G. Dietterich Hierarchical Reinforcement Learning with the MAXQ Value Function Decomposition , 1999, J. Artif. Intell. Res..

[12]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[13]  Joelle Pineau,et al.  Spoken Dialog Management for Robots , 2000, ACL 2000.

[14]  Stuart J. Russell,et al.  Reinforcement Learning with Hierarchies of Machines , 1997, NIPS.

[15]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[16]  Doina Precup,et al.  Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning , 1999, Artif. Intell..

[17]  Ingemar J. Cox,et al.  Autonomous Robot Vehicles , 1990, Springer New York.

[18]  Gang Wang,et al.  Hierarchical Optimization of Policy-Coupled Semi-Markov Decision Processes , 1999, ICML.

[19]  W. Lovejoy A survey of algorithmic methods for partially observed Markov decision processes , 1991 .

[20]  Michael L. Littman,et al.  Incremental Pruning: A Simple, Fast, Exact Method for Partially Observable Markov Decision Processes , 1997, UAI.

[21]  A. Jazwinski Stochastic Processes and Filtering Theory , 1970 .

[22]  Satinder Singh Transfer of Learning by Composing Solutions of Elemental Sequential Tasks , 1992, Mach. Learn..

[23]  Sridhar Mahadevan,et al.  Learning Hierarchical Partially Observable Markov Decision Process Models for Robot Navigation , 2001 .

[24]  Thomas Dean,et al.  Decomposition Techniques for Planning in Stochastic Domains , 1995, IJCAI.

[25]  Ronen I. Brafman,et al.  A Heuristic Variable Grid Solution Method for POMDPs , 1997, AAAI/IAAI.

[26]  Joelle Pineau,et al.  Experiences with a mobile robotic guide for the elderly , 2002, AAAI/IAAI.

[27]  Martha E. Pollack,et al.  A Plan-Based Personalized Cognitive Orthotic , 2002, AIPS.

[28]  Sridhar Mahadevan,et al.  Hierarchical Memory-Based Reinforcement Learning , 2000, NIPS.

[29]  Erann Gat,et al.  ESL: a language for supporting robust plan execution in embedded autonomous agents , 1997, 1997 IEEE Aerospace Conference.

[30]  Sebastian Thrun,et al.  Finding Structure in Reinforcement Learning , 1994, NIPS.

[31]  Edward J. Sondik,et al.  The Optimal Control of Partially Observable Markov Processes over a Finite Horizon , 1973, Oper. Res..

[32]  Thomas G. Dietterich An Overview of MAXQ Hierarchical Reinforcement Learning , 2000, SARA.

[33]  T. Başar,et al.  A New Approach to Linear Filtering and Prediction Problems , 2001 .

[34]  Leslie Pack Kaelbling,et al.  Learning Policies for Partially Observable Environments: Scaling Up , 1997, ICML.

[35]  Richard Fikes,et al.  STRIPS: A New Approach to the Application of Theorem Proving to Problem Solving , 1971, IJCAI.

[36]  Andrew G. Barto,et al.  Automatic Discovery of Subgoals in Reinforcement Learning using Diverse Density , 2001, ICML.

[37]  Nicholas Kushmerick,et al.  An Algorithm for Probabilistic Planning , 1995, Artif. Intell..

[38]  Mosur Ravishankar,et al.  Efficient Algorithms for Speech Recognition. , 1996 .

[39]  Malcolm R. K. Ryan Using Abstract Models of Behaviours to Automatically Generate Reinforcement Learning Hierarchies , 2002, ICML.

[40]  Craig Boutilier,et al.  Stochastic dynamic programming with factored representations , 2000, Artif. Intell..

[41]  Gerald Tesauro,et al.  Temporal difference learning and TD-Gammon , 1995, CACM.

[42]  David Andre,et al.  Programmable Reinforcement Learning Agents , 2000, NIPS.

[43]  Thomas G. Dietterich,et al.  A POMDP Approximation Algorithm That Anticipates the Need to Observe , 2000, PRICAI.

[44]  Geoffrey E. Hinton,et al.  Feudal Reinforcement Learning , 1992, NIPS.

[45]  Milos Hauskrecht,et al.  Value-Function Approximations for Partially Observable Markov Decision Processes , 2000, J. Artif. Intell. Res..

[46]  Robert Givan,et al.  Model Minimization in Markov Decision Processes , 1997, AAAI/IAAI.

[47]  Andrew G. Barto,et al.  PolicyBlocks: An Algorithm for Creating Useful Macro-Actions in Reinforcement Learning , 2002, ICML.

[48]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[49]  Chelsea C. White,et al.  A survey of solution techniques for the partially observed Markov decision process , 1991, Ann. Oper. Res..

[50]  Leslie Pack Kaelbling,et al.  Planning and Acting in Partially Observable Stochastic Domains , 1998, Artif. Intell..

[51]  Stuart J. Russell,et al.  Approximating Optimal Policies for Partially Observable Stochastic Domains , 1995, IJCAI.

[52]  Kee-Eung Kim,et al.  Solving Very Large Weakly Coupled Markov Decision Processes , 1998, AAAI/IAAI.

[53]  Wolfram Burgard,et al.  Experiences with an Interactive Museum Tour-Guide Robot , 1999, Artif. Intell..

[54]  Joelle Pineau,et al.  Pearl: A Mobile Robotic Assistant for the Elderly , 2002 .

[55]  Sebastian Thrun,et al.  Monte Carlo POMDPs , 1999, NIPS.

[56]  Blai Bonet,et al.  Planning as heuristic search , 2001, Artif. Intell..

[57]  Joelle Pineau,et al.  High-level robot behavior control using POMDPs , 2002 .

[58]  Paul Taylor,et al.  Festival Speech Synthesis System , 1998 .

[59]  Andrew Y. Ng,et al.  Policy Invariance Under Reward Transformations: Theory and Application to Reward Shaping , 1999, ICML.

[60]  Andrew W. Moore,et al.  The parti-game algorithm for variable resolution reinforcement learning in multidimensional state-spaces , 2004, Machine Learning.

[61]  Satinder P. Singh,et al.  How to Dynamically Merge Markov Decision Processes , 1997, NIPS.

[62]  Andrew McCallum,et al.  Overcoming Incomplete Perception with Utile Distinction Memory , 1993, ICML.

[63]  Rodney A. Brooks,et al.  A Robust Layered Control Syste For A Mobile Robot , 2022 .