Hierarchical Reinforcement Learning: Assignment of Behaviours to Subpolicies by Self-Organization

A new Hierarchical Reinforcement Learning algorithm calle d HABS (Hierarchical Assignment of Behaviours by Self-organizing ) is proposed in this thesis. H ABS uses self-organization to assign behaviours to uncommitted subpolicies. Task decompositions are central in Hierarchical Reinforce ment Learning, but in most approaches they need to be designed a priori, and the agent only needs to fill in the details in the fixed stru cture. In contrast, the new algorithm presented here autonomously identifies behaviours in an abstracthigher level state space. Subpolicies self-organize to specializ e for the high level behaviours that are actually needed. These subpolicies are then used as the high level act ions. HABS is a continuation of the HASSLE algorithm proposed by Bakker and Schmidhuber [1, 2]. HASSLE uses abstract states (called subgoals) both as its high leve l stat sand as its high level actions . Subpolicies specialize in transitions (i.e. high level act ions) between subgoals and the mapping between transitions and subpolicies is learned. H ASSLE is goal directed (subgoals) and this has the undesired consequence that the number of higher level actions (the tra nsi ions between subgoals) increases when the problem scales up. This action explosionis unfortunate because it slows down exploration and vastly increases memory usage. Furthermore the goal directed natu re prevents HASSLE from using function approximators more than two more layers. The proposed algorithm can be viewed as a short-circuited ve rsion of HASSLE. HABS is a solution to the problem that results from using subgoals as actions. I t tries to map all the experienced (high level) behaviours to a (small) set of subpolicies, which can be used directly as high level actions. This makes it suitable for use of a neural network for its high leve l policy, unlike many other Hierarchical Reinforcement Learning algorithms.

[1]  Leemon C. Baird,et al.  Residual advantage learning applied to a differential game , 1996, Proceedings of International Conference on Neural Networks (ICNN'96).

[2]  Thomas G. Dietterich,et al.  Solving the Multiple Instance Problem with Axis-Parallel Rectangles , 1997, Artif. Intell..

[3]  Stuart J. Russell,et al.  Reinforcement Learning with Hierarchies of Machines , 1997, NIPS.

[4]  Andrew W. Moore,et al.  Reinforcement Learning: A Survey , 1996, J. Artif. Intell. Res..

[5]  Alex M. Andrew,et al.  Reinforcement Learning: : An Introduction , 1998 .

[6]  Amy McGovern,et al.  AcQuire-macros: An Algorithm for Automatically Learning Macro-actions , 1998 .

[7]  Bernhard Hengst,et al.  Discovering hierarchy in reinforcement learning , 2003 .

[8]  L. Baird Reinforcement Learning Through Gradient Descent , 1999 .

[9]  Andrew G. Barto,et al.  Automatic Discovery of Subgoals in Reinforcement Learning using Diverse Density , 2001, ICML.

[10]  Jürgen Schmidhuber,et al.  HQ-Learning , 1997, Adapt. Behav..

[11]  Minoru Asada,et al.  Behavior acquisition by multi-layered reinforcement learning , 1999, IEEE SMC'99 Conference Proceedings. 1999 IEEE International Conference on Systems, Man, and Cybernetics (Cat. No.99CH37028).

[12]  John E. Laird,et al.  Human-Level AI's Killer Application: Interactive Computer Games , 2000, AI Mag..

[13]  Thomas G. Dietterich Hierarchical Reinforcement Learning with the MAXQ Value Function Decomposition , 1999, J. Artif. Intell. Res..

[14]  Doina Precup,et al.  Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning , 1999, Artif. Intell..

[15]  Martin A. Riedmiller,et al.  Speeding-up Reinforcement Learning with , 2001 .

[16]  Sridhar Mahadevan,et al.  Hierarchical multi-agent reinforcement learning , 2001, AGENTS '01.

[17]  Lars Niklasson,et al.  Sensory Flow Segmentation Using a Resource Allocating Vector Quantizer , 2000, SSPR/SPR.

[18]  Sridhar Mahadevan,et al.  Extending Hierarchical Reinforcement Learning to Continuous-Time, Average-Reward, and Multi-Agent Models , 2003 .

[19]  Rodney A. Brooks,et al.  A Robust Layered Control Syste For A Mobile Robot , 2022 .

[20]  Andrew W. Moore,et al.  Multi-Value-Functions: Efficient Automatic Action Hierarchies for Multiple Goal MDPs , 1999, IJCAI.

[21]  Robert J Serling FOR REASONS UNKNOWN , 2003 .

[22]  Bernhard Hengst,et al.  Concurrent Discovery of Task Hierarchies , 2004 .

[23]  Mark D. Pendrith,et al.  RL-TOPS: An Architecture for Modularity and Re-Use in Reinforcement Learning , 1998, ICML.

[24]  Lars Niklasson,et al.  Time series segmentation using an adaptive resource allocating vector quantization network based on change detection , 2000, Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks. IJCNN 2000. Neural Computing: New Challenges and Perspectives for the New Millennium.

[25]  Manuela M. Veloso,et al.  Layered Learning , 2000, ECML.

[26]  Geoffrey E. Hinton,et al.  Feudal Reinforcement Learning , 1992, NIPS.

[27]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[28]  Shie Mannor,et al.  Q-Cut - Dynamic Discovery of Sub-goals in Reinforcement Learning , 2002, ECML.

[29]  Mance E. Harmon,et al.  Reinforcement Learning: A Tutorial. , 1997 .

[30]  Jürgen Schmidhuber,et al.  Hierarchical reinforcement learning with subpolicies specializing for learned subgoals , 2004, Neural Networks and Computational Intelligence.

[31]  Marco Wiering,et al.  Hierarchical Assignment of Behaviours to Subpolicies , 2008 .

[32]  Bernhard Hengst,et al.  Discovering multiple levels of a task hierarchy concurrently , 2004, Robotics Auton. Syst..