World Model as a Graph: Learning Latent Landmarks for Planning

Planning - the ability to analyze the structure of a problem in the large and decompose it into interrelated subproblems - is a hallmark of human intelligence. While deep reinforcement learning (RL) has shown great promise for solving relatively straightforward control tasks, it remains an open problem how to best incorporate planning into existing deep RL paradigms to handle increasingly complex environments. One prominent framework, Model-Based RL, learns a world model and plans using step-by-step virtual rollouts. This type of world model quickly diverges from reality when the planning horizon increases, thus struggling at long-horizon planning. How can we learn world models that endow agents with the ability to do temporally extended reasoning? In this work, we propose to learn graph-structured world models composed of sparse, multi-step transitions. We devise a novel algorithm to learn latent landmarks that are scattered (in terms of reachability) across the goal space as the nodes on the graph. In this same graph, the edges are the reachability estimates distilled from Q-functions. On a variety of high-dimensional continuous control tasks ranging from robotic manipulation to navigation, we demonstrate that our method, named L3P, significantly outperforms prior work, and is oftentimes the only method capable of leveraging both the robustness of model-free RL and generalization of graph-search algorithms. We believe our work is an important step towards scalable planning in reinforcement learning.

[1]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[2]  Marcin Andrychowicz,et al.  Multi-Goal Reinforcement Learning: Challenging Robotics Environments and Request for Research , 2018, ArXiv.

[3]  Mohammad Norouzi,et al.  Mastering Atari with Discrete World Models , 2020, ICLR.

[4]  Jimmy Ba,et al.  Maximum Entropy Gain Exploration for Long Horizon Multi-goal Reinforcement Learning , 2020, ICML.

[5]  Shimon Whiteson,et al.  TreeQN and ATreeC: Differentiable Tree-Structured Models for Deep Reinforcement Learning , 2017, ICLR.

[6]  Rui Zhao,et al.  Maximum Entropy-Regularized Multi-Goal Reinforcement Learning , 2019, ICML.

[7]  Pieter Abbeel,et al.  Model-Ensemble Trust-Region Policy Optimization , 2018, ICLR.

[8]  Sergey Levine,et al.  Visual Reinforcement Learning with Imagined Goals , 2018, NeurIPS.

[9]  Tom Schaul,et al.  The Predictron: End-To-End Learning and Planning , 2016, ICML.

[10]  E. Spelke,et al.  Human Spatial Representation: Insights from Animals , 2002 .

[11]  Hugh F. Durrant-Whyte,et al.  Simultaneous localization and mapping: part I , 2006, IEEE Robotics & Automation Magazine.

[12]  Vladlen Koltun,et al.  Semi-parametric Topological Memory for Navigation , 2018, ICLR.

[13]  Marcin Andrychowicz,et al.  Hindsight Experience Replay , 2017, NIPS.

[14]  Hanspeter A. Mallot,et al.  Navigation and Acquisition of Spatial Knowledge in a Virtual Maze , 1998, Journal of Cognitive Neuroscience.

[15]  Richard S. Sutton,et al.  Dyna, an integrated architecture for learning, planning, and reacting , 1990, SGAR.

[16]  Pieter Abbeel,et al.  Benchmarking Deep Reinforcement Learning for Continuous Control , 2016, ICML.

[17]  Sergey Levine,et al.  Search on the Replay Buffer: Bridging Planning and Reinforcement Learning , 2019, NeurIPS.

[18]  Chelsea Finn,et al.  Long-Horizon Visual Planning with Goal-Conditioned Hierarchical Predictors , 2020, NeurIPS.

[19]  Razvan Pascanu,et al.  Imagination-Augmented Agents for Deep Reinforcement Learning , 2017, NIPS.

[20]  Tom Schaul,et al.  FeUdal Networks for Hierarchical Reinforcement Learning , 2017, ICML.

[21]  Marc Pollefeys,et al.  Episodic Curiosity through Reachability , 2018, ICLR.

[22]  Hao Su,et al.  Mapping State Space using Landmarks for Universal Goal Reaching , 2019, NeurIPS.

[23]  Rémi Coulom,et al.  Efficient Selectivity and Backup Operators in Monte-Carlo Tree Search , 2006, Computers and Games.

[24]  Pieter Abbeel,et al.  Value Iteration Networks , 2016, NIPS.

[25]  Pat Langley,et al.  Crafting Papers on Machine Learning , 2000, ICML.

[26]  J. Doran,et al.  Experiments with the Graph Traverser program , 1966, Proceedings of the Royal Society of London. Series A. Mathematical and Physical Sciences.

[27]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[28]  S. LaValle Rapidly-exploring random trees : a new tool for path planning , 1998 .

[29]  Edsger W. Dijkstra,et al.  A note on two problems in connexion with graphs , 1959, Numerische Mathematik.

[30]  Edward Groshev,et al.  Sub-Goal Trees - a Framework for Goal-Based Reinforcement Learning , 2020, ICML.

[31]  Joelle Pineau,et al.  Plan2Vec: Unsupervised Representation Learning by Latent Plans , 2020, L4DC.

[32]  Pieter Abbeel,et al.  Sparse Graphical Memory for Robust Planning , 2020, NeurIPS.

[33]  Sergey Levine,et al.  Skew-Fit: State-Covering Self-Supervised Reinforcement Learning , 2019, ICML.

[34]  Jimmy Ba,et al.  Exploring Model-based Planning with Policy Networks , 2019, ICLR.

[35]  Unsupervised Representation Learning by Latent Plans , 2019 .

[36]  Sergey Levine,et al.  Temporal Difference Models: Model-Free Deep RL for Model-Based Control , 2018, ICLR.

[37]  Sergey Levine,et al.  Planning with Goal-Conditioned Policies , 2019, NeurIPS.

[38]  Geoffrey E. Hinton,et al.  Feudal Reinforcement Learning , 1992, NIPS.

[39]  Yang Wang,et al.  Robot Navigation by Waypoints , 2008, J. Intell. Robotic Syst..

[40]  Eric P. Xing,et al.  Gated Path Planning Networks , 2018, ICML.

[41]  Allan Jabri,et al.  Universal Planning Networks , 2018, ICML.

[42]  Sergey Levine,et al.  Data-Efficient Hierarchical Reinforcement Learning , 2018, NeurIPS.

[43]  Jürgen Schmidhuber,et al.  Recurrent World Models Facilitate Policy Evolution , 2018, NeurIPS.

[44]  Ruben Villegas,et al.  Learning Latent Dynamics for Planning from Pixels , 2018, ICML.

[45]  Wulfram Gerstner,et al.  Efficient Model-Based Deep Reinforcement Learning with Variational State Tabulation , 2018, ICML.

[46]  Sergey Levine,et al.  When to Trust Your Model: Model-Based Policy Optimization , 2019, NeurIPS.

[47]  Pieter Abbeel,et al.  Hallucinative Topological Memory for Zero-Shot Visual Planning , 2020, ICML.

[48]  Sergey Levine,et al.  Neural Network Dynamics for Model-Based Deep Reinforcement Learning with Model-Free Fine-Tuning , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[49]  Nils J. Nilsson,et al.  A Formal Basis for the Heuristic Determination of Minimum Cost Paths , 1968, IEEE Trans. Syst. Sci. Cybern..

[50]  Yuandong Tian,et al.  Algorithmic Framework for Model-based Deep Reinforcement Learning with Theoretical Guarantees , 2018, ICLR.

[51]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[52]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[53]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.