Sparse Graphical Memory for Robust Planning

To operate effectively in the real world, agents should be able to act from high-dimensional raw sensory input such as images and achieve diverse goals across long time-horizons. Current deep reinforcement and imitation learning methods can learn directly from high-dimensional inputs but do not scale well to long-horizon tasks. In contrast, classical graphical methods like A* search are able to solve long-horizon tasks, but assume that the state space is abstracted away from raw sensory input. Recent works have attempted to combine the strengths of deep learning and classical planning; however, dominant methods in this domain are still quite brittle and scale poorly with the size of the environment. We introduce Sparse Graphical Memory (SGM), a new data structure that stores states and feasible transitions in a sparse memory. SGM aggregates states according to a novel two-way consistency objective, adapting classic state aggregation criteria to goal-conditioned RL: two states are redundant when they are interchangeable both as goals and as starting states. Theoretically, we prove that merging nodes according to two-way consistency leads to an increase in shortest path lengths that scales only linearly with the merging threshold. Experimentally, we show that SGM significantly outperforms current state of the art methods on long horizon, sparse-reward visual navigation tasks. Project video and code are available at this https URL

[1]  Marek Wydmuch,et al.  ViZDoom Competitions: Playing Doom From Pixels , 2018, IEEE Transactions on Games.

[2]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[3]  C. Hargraves,et al.  DIRECT TRAJECTORY OPTIMIZATION USING NONLINEAR PROGRAMMING AND COLLOCATION , 1987 .

[4]  Leslie Pack Kaelbling,et al.  Learning to Achieve Goals , 1993, IJCAI.

[5]  William H Warren,et al.  Do humans integrate routes into a cognitive map? Map- versus landmark-based navigation of novel shortcuts. , 2010, Journal of experimental psychology. Learning, memory, and cognition.

[6]  Vladlen Koltun,et al.  Benchmarking Classic and Learned Navigation in Complex 3D Environments , 2019, ArXiv.

[7]  Sergey Levine,et al.  Visual Reinforcement Learning with Imagined Goals , 2018, NeurIPS.

[8]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[9]  Nan Jiang Notes on State Abstractions , 2018 .

[10]  Matthew W. Hoffman,et al.  Distributed Distributional Deterministic Policy Gradients , 2018, ICLR.

[11]  Rob Fergus,et al.  Composable Planning with Attributes , 2018, ICML.

[12]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[13]  Richard Socher,et al.  Learning World Graphs to Accelerate Hierarchical Reinforcement Learning , 2019, ArXiv.

[14]  Hugh F. Durrant-Whyte,et al.  Simultaneous localization and mapping: part I , 2006, IEEE Robotics & Automation Magazine.

[15]  Sergey Levine,et al.  Search on the Replay Buffer: Bridging Planning and Reinforcement Learning , 2019, NeurIPS.

[16]  Robert Givan,et al.  Equivalence notions and model minimization in Markov decision processes , 2003, Artif. Intell..

[17]  Thomas A. Funkhouser,et al.  Semantic Scene Completion from a Single Depth Image , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Christopher Burgess,et al.  beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework , 2016, ICLR 2016.

[19]  Silvio Savarese,et al.  A Behavioral Approach to Visual Navigation with Graph Localization Networks , 2019, Robotics: Science and Systems.

[20]  Marc G. Bellemare,et al.  A Distributional Perspective on Reinforcement Learning , 2017, ICML.

[21]  Ruslan Salakhutdinov,et al.  Neural Topological SLAM for Visual Navigation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Doina Precup,et al.  Metrics for Finite Markov Decision Processes , 2004, AAAI.

[23]  Dieter Fox,et al.  Scaling Local Control to Large-Scale Topological Navigation , 2020, 2020 IEEE International Conference on Robotics and Automation (ICRA).

[24]  E. Spelke,et al.  Human Spatial Representation: Insights from Animals , 2002 .

[25]  Marcin Andrychowicz,et al.  Hindsight Experience Replay , 2017, NIPS.

[26]  Thomas J. Walsh,et al.  Towards a Unified Theory of State Abstraction for MDPs , 2006, AI&M.

[27]  Manfred Morari,et al.  Model predictive control: Theory and practice - A survey , 1989, Autom..

[28]  Hao Su,et al.  Mapping State Space using Landmarks for Universal Goal Reaching , 2019, NeurIPS.

[29]  D. Stea Cognitive Maps in Rats and Men , 2017 .

[30]  Carsten Lund,et al.  On the hardness of approximating minimization problems , 1994, JACM.

[31]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[32]  Henry Zhu,et al.  Soft Actor-Critic Algorithms and Applications , 2018, ArXiv.

[33]  Hanspeter A. Mallot,et al.  Navigation and Acquisition of Spatial Knowledge in a Virtual Maze , 1998, Journal of Cognitive Neuroscience.

[34]  Dario Amodei,et al.  Benchmarking Safe Exploration in Deep Reinforcement Learning , 2019 .

[35]  Alex Graves,et al.  Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[36]  E. Tolman Cognitive maps in rats and men. , 1948, Psychological review.

[37]  Andrew McCallum,et al.  Overcoming Incomplete Perception with Utile Distinction Memory , 1993, ICML.

[38]  R. Rubinstein The Cross-Entropy Method for Combinatorial and Continuous Optimization , 1999 .

[39]  Vladlen Koltun,et al.  Semi-parametric Topological Memory for Navigation , 2018, ICLR.

[40]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[41]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[42]  Pieter Abbeel,et al.  Hallucinative Topological Memory for Zero-Shot Visual Planning , 2020, ICML.

[43]  Tom Schaul,et al.  Universal Value Function Approximators , 2015, ICML.

[44]  Ran Raz,et al.  A sub-constant error-probability low-degree test, and a sub-constant error-probability PCP characterization of NP , 1997, STOC '97.

[45]  Hugh Durrant-Whyte,et al.  Simultaneous localization and mapping (SLAM): part II , 2006 .

[46]  J. Doran,et al.  Experiments with the Graph Traverser program , 1966, Proceedings of the Royal Society of London. Series A. Mathematical and Physical Sciences.

[47]  S. LaValle Rapidly-exploring random trees : a new tool for path planning , 1998 .

[48]  Edsger W. Dijkstra,et al.  A note on two problems in connexion with graphs , 1959, Numerische Mathematik.

[49]  Sergey Levine,et al.  Temporal Difference Models: Model-Free Deep RL for Model-Based Control , 2018, ICLR.

[50]  Nils J. Nilsson,et al.  A Formal Basis for the Heuristic Determination of Minimum Cost Paths , 1968, IEEE Trans. Syst. Sci. Cybern..