Balancing Exploration and Exploitation in Hierarchical Reinforcement Learning via Latent Landmark Graphs

Goal-Conditioned Hierarchical Reinforcement Learning (GCHRL) is a promising paradigm to address the exploration-exploitation dilemma in reinforcement learning. It decomposes the source task into sub goal conditional subtasks and conducts exploration and exploitation in the subgoal space. The effectiveness of GCHRL heavily relies on sub goal representation functions and sub goal selection strategy. However, existing works often overlook the temporal coherence in GCHRL when learning latent sub goal representations and lack an efficient sub goal selection strategy that balances exploration and exploitation. This paper proposes HIerarchical reinforcement learning via dynamically building Latent Landmark graphs (HILL) to overcome these limitations. HILL learns latent subgoal representations that satisfy temporal coherence using a contrastive representation learning objective. Based on these representations, HILL dynamically builds latent landmark graphs and employs a novelty measure on nodes and a utility measure on edges. Finally, HILL develops a subgoal selection strategy that balances exploration and exploitation by jointly considering both measures. Experimental results demonstrate that HILL outperforms state-of-the-art baselines on continuous control tasks with sparse rewards in sample efficiency and asymptotic performance. Our code is available at https://github.com/papercode2022/HILL.

[1]  M. Tomizuka,et al.  Hierarchical Planning Through Goal-Conditioned Offline Reinforcement Learning , 2022, IEEE Robotics and Automation Letters.

[2]  Chongjie Zhang,et al.  Active Hierarchical Exploration with Stable Subgoal Representation Learning , 2021, ICLR.

[3]  Bradly C. Stadie,et al.  World Model as a Graph: Learning Latent Landmarks for Planning , 2020, ICML.

[4]  Dijun Luo,et al.  Efficient Fully-Offline Meta-Reinforcement Learning via Distance Metric Learning and Behavior Regularization , 2020, ArXiv.

[5]  Xiaolin Hu,et al.  Generating Adjacency-Constrained Subgoals in Hierarchical Reinforcement Learning , 2020, NeurIPS.

[6]  Manfred Eppe,et al.  Curious Hierarchical Actor-Critic Reinforcement Learning , 2020, ICANN.

[7]  Pieter Abbeel,et al.  Sparse Graphical Memory for Robust Planning , 2020, NeurIPS.

[8]  S. Levine,et al.  Why Does Hierarchy (Sometimes) Work So Well in Reinforcement Learning? , 2019, ArXiv.

[9]  Chelsea Finn,et al.  Hierarchical Foresight: Self-Supervised Learning of Long-Horizon Tasks via Visual Subgoal Generation , 2019, ICLR.

[10]  Richard Socher,et al.  Learning World Graphs to Accelerate Hierarchical Reinforcement Learning , 2019, ArXiv.

[11]  Sergey Levine,et al.  Search on the Replay Buffer: Bridging Planning and Reinforcement Learning , 2019, NeurIPS.

[12]  Wolfram Burgard,et al.  Scheduled Intrinsic Drive: A Hierarchical Take on Intrinsically Motivated Exploration , 2019, ArXiv.

[13]  Sergey Levine,et al.  Learning Actionable Representations with Goal-Conditioned Policies , 2018, ICLR.

[14]  Marlos C. Machado,et al.  Count-Based Exploration with the Successor Representation , 2018, AAAI.

[15]  Takashi Onishi,et al.  Hierarchical Reinforcement Learning with Abductive Planning , 2018, ArXiv.

[16]  Sergey Levine,et al.  Data-Efficient Hierarchical Reinforcement Learning , 2018, NeurIPS.

[17]  Pierre-Yves Oudeyer,et al.  Unsupervised Learning of Goal Spaces for Intrinsically Motivated Goal Exploration , 2018, ICLR.

[18]  Sergey Levine,et al.  Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , 2018, ICML.

[19]  Kate Saenko,et al.  Learning Multi-Level Hierarchies with Hindsight , 2017, ICLR.

[20]  Marcin Andrychowicz,et al.  Hindsight Experience Replay , 2017, NIPS.

[21]  Murray Shanahan,et al.  Feature Control as Intrinsic Motivation for Hierarchical Reinforcement Learning , 2017, IEEE Transactions on Neural Networks and Learning Systems.

[22]  Tom Schaul,et al.  FeUdal Networks for Hierarchical Reinforcement Learning , 2017, ICML.

[23]  Filip De Turck,et al.  #Exploration: A Study of Count-Based Exploration for Deep Reinforcement Learning , 2016, NIPS.

[24]  Joshua B. Tenenbaum,et al.  Hierarchical Deep Reinforcement Learning: Integrating Temporal Abstraction and Intrinsic Motivation , 2016, NIPS.

[25]  Tom Schaul,et al.  Universal Value Function Approximators , 2015, ICML.

[26]  Yuval Tassa,et al.  MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[27]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[28]  Moses Charikar,et al.  Similarity estimation techniques from rounding algorithms , 2002, STOC '02.

[29]  Terrence J. Sejnowski,et al.  Slow Feature Analysis: Unsupervised Learning of Invariances , 2002, Neural Computation.

[30]  Doina Precup,et al.  Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning , 1999, Artif. Intell..

[31]  Geoffrey E. Hinton,et al.  Feudal Reinforcement Learning , 1992, NIPS.

[32]  David C. Walden,et al.  The ARPA Network Design Decisions , 1977, Comput. Networks.

[33]  Siyuan Li,et al.  Learning Subgoal Representations with Slow Dynamics , 2021, ICLR.

[34]  Sridhar Mahadevan,et al.  Recent Advances in Hierarchical Reinforcement Learning , 2003, Discret. Event Dyn. Syst..