论文信息 - Search on the Replay Buffer: Bridging Planning and Reinforcement Learning

Search on the Replay Buffer: Bridging Planning and Reinforcement Learning

The history of learning for control has been an exciting back and forth between two broad classes of algorithms: planning and reinforcement learning. Planning algorithms effectively reason over long horizons, but assume access to a local policy and distance metric over collision-free paths. Reinforcement learning excels at learning policies and the relative values of states, but fails to plan over long horizons. Despite the successes of each method in various domains, tasks that require reasoning over long horizons with limited feedback and high-dimensional observations remain exceedingly challenging for both planning and reinforcement learning algorithms. Frustratingly, these sorts of tasks are potentially the most useful, as they are simple to design (a human only need to provide an example goal state) and avoid reward shaping, which can bias the agent towards finding a sub-optimal solution. We introduce a general control algorithm that combines the strengths of planning and reinforcement learning to effectively solve these tasks. Our aim is to decompose the task of reaching a distant goal state into a sequence of easier tasks, each of which corresponds to reaching a subgoal. Planning algorithms can automatically find these waypoints, but only if provided with suitable abstractions of the environment -- namely, a graph consisting of nodes and edges. Our main insight is that this graph can be constructed via reinforcement learning, where a goal-conditioned value function provides edge weights, and nodes are taken to be previously seen observations in a replay buffer. Using graph search over our replay buffer, we can automatically generate this sequence of subgoals, even in image-based environments. Our algorithm, search on the replay buffer (SoRB), enables agents to solve sparse reward tasks over one hundred steps, and generalizes substantially better than standard RL algorithms.

[1] Doina Precup,et al. Temporal abstraction in reinforcement learning , 2000, ICML 2000.

[2] Rob Fergus,et al. Composable Planning with Attributes , 2018, ICML.

[3] Leslie Pack Kaelbling,et al. Learning to Achieve Goals , 1993, IJCAI.

[4] Richard S. Sutton,et al. Integrated Architectures for Learning, Planning, and Reacting Based on Approximating Dynamic Programming , 1990, ML.

[5] Pieter Abbeel,et al. Meta Learning Shared Hierarchies , 2017, ICLR.

[6] Ross A. Knepper,et al. DeepMPC: Learning Deep Latent Features for Model Predictive Control , 2015, Robotics: Science and Systems.

[7] Jitendra Malik,et al. Learning to Poke by Poking: Experiential Learning of Intuitive Physics , 2016, NIPS.

[8] Razvan Pascanu,et al. Learning to Navigate in Complex Environments , 2016, ICLR.

[9] Sergey Levine,et al. Neural Network Dynamics for Model-Based Deep Reinforcement Learning with Model-Free Fine-Tuning , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[10] Josef Hadar,et al. Rules for Ordering Uncertain Prospects , 1969 .

[11] Sergey Levine,et al. Data-Efficient Hierarchical Reinforcement Learning , 2018, NeurIPS.

[12] Steven M. LaValle,et al. Planning algorithms , 2006 .

[13] Doina Precup,et al. The Option-Critic Architecture , 2016, AAAI.

[14] Leslie Pack Kaelbling,et al. Hierarchical Learning in Stochastic Domains: Preliminary Results , 1993, ICML.

[15] Benjamin Van Roy,et al. Deep Exploration via Bootstrapped DQN , 2016, NIPS.

[16] Yifan Wu,et al. The Laplacian in RL: Learning Representations with Efficient Approximations , 2018, ICLR.

[17] Ali Farhadi,et al. Target-driven visual navigation in indoor scenes using deep reinforcement learning , 2016, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[18] Vitaly Levdik,et al. Time Limits in Reinforcement Learning , 2017, ICML.

[19] Kate Saenko,et al. Hierarchical Reinforcement Learning with Hindsight , 2018, ArXiv.

[20] Wolfram Burgard,et al. Principles of Robot Motion: Theory, Algorithms, and Implementation ERRATA!!!! 1 , 2007 .

[21] Aleksandra Faust,et al. Learning Navigation Behaviors End-to-End With AutoRL , 2018, IEEE Robotics and Automation Letters.

[22] Eric P. Xing,et al. Gated Path Planning Networks , 2018, ICML.

[23] Manfred Lau,et al. Behavior planning for character animation , 2005, SCA '05.

[24] Marc G. Bellemare,et al. A Distributional Perspective on Reinforcement Learning , 2017, ICML.

[25] Chris Drummond,et al. Accelerating Reinforcement Learning by Composing Solutions of Automatically Identified Subtasks , 2011, J. Artif. Intell. Res..

[26] B. Faverjon,et al. Probabilistic Roadmaps for Path Planning in High-Dimensional Con(cid:12)guration Spaces , 1996 .

[27] Alec Radford,et al. Proximal Policy Optimization Algorithms , 2017, ArXiv.

[28] Sergey Levine,et al. High-Dimensional Continuous Control Using Generalized Advantage Estimation , 2015, ICLR.

[29] Sergey Levine,et al. Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models , 2018, NeurIPS.

[30] D. Freedman,et al. Some Asymptotic Theory for the Bootstrap , 1981 .

[31] Doina Precup,et al. Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning , 1999, Artif. Intell..

[32] Alex Graves,et al. Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[33] R. J. Williams,et al. Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[34] Marcin Andrychowicz,et al. Overcoming Exploration in Reinforcement Learning with Demonstrations , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[35] Pieter Abbeel,et al. Model-Ensemble Trust-Region Policy Optimization , 2018, ICLR.

[36] Tom Schaul,et al. Universal Value Function Approximators , 2015, ICML.

[37] Razvan Pascanu,et al. Imagination-Augmented Agents for Deep Reinforcement Learning , 2017, NIPS.

[38] Charles Blundell,et al. Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles , 2016, NIPS.

[39] Peter Dayan,et al. Q-learning , 1992, Machine Learning.

[40] Stuart J. Russell,et al. Reinforcement Learning with Hierarchies of Machines , 1997, NIPS.

[41] Marcin Andrychowicz,et al. Hindsight Experience Replay , 2017, NIPS.

[42] Lydia Tapia,et al. PRM-RL: Long-range Robotic Navigation Tasks by Combining Reinforcement Learning and Sampling-Based Planning , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[43] Dilek Z. Hakkani-Tür,et al. FollowNet: Robot Navigation by Following Natural Language Directions with Deep Reinforcement Learning , 2018, ArXiv.

[44] Joshua B. Tenenbaum,et al. Hierarchical Deep Reinforcement Learning: Integrating Temporal Abstraction and Intrinsic Motivation , 2016, NIPS.

[45] Qi Wu,et al. Vision-and-Language Navigation: Interpreting Visually-Grounded Navigation Instructions in Real Environments , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[46] Sergey Levine,et al. Trust Region Policy Optimization , 2015, ICML.

[47] Thomas A. Funkhouser,et al. Semantic Scene Completion from a Single Depth Image , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[48] Sergey Levine,et al. Learning Latent Plans from Play , 2019, CoRL.

[49] Allan Jabri,et al. Universal Planning Networks , 2018, ICML.

[50] Byron Boots,et al. Differentiable MPC for End-to-end Planning and Control , 2018, NeurIPS.

[51] Sergey Levine,et al. Deep visual foresight for planning robot motion , 2016, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[52] Ion Stoica,et al. Multi-Level Discovery of Deep Options , 2017, ArXiv.

[53] Sergey Levine,et al. Space-time planning with parameterized locomotion controllers , 2011, TOGS.

[54] Alicia P. Wolfe,et al. Identifying useful subgoals in reinforcement learning by local graph partitioning , 2005, ICML.

[55] Martin A. Riedmiller,et al. Self-supervised Learning of Image Embedding for Continuous Control , 2019, ArXiv.

[56] Vladlen Koltun,et al. Semi-parametric Topological Memory for Navigation , 2018, ICLR.

[57] Rahul Sukthankar,et al. Cognitive Mapping and Planning for Visual Navigation , 2017, International Journal of Computer Vision.

[58] Sergey Levine,et al. Temporal Difference Models: Model-Free Deep RL for Model-Based Control , 2018, ICLR.

[59] Demis Hassabis,et al. Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[60] Howie Choset,et al. Principles of Robot Motion: Theory, Algorithms, and Implementation ERRATA!!!! 1 , 2007 .

[61] Honglak Lee,et al. Action-Conditional Video Prediction using Deep Networks in Atari Games , 2015, NIPS.

[62] Marc Pollefeys,et al. Episodic Curiosity through Reachability , 2018, ICLR.

[63] Martin A. Riedmiller,et al. Embed to Control: A Locally Linear Latent Dynamics Model for Control from Raw Images , 2015, NIPS.

[64] Pieter Abbeel,et al. Value Iteration Networks , 2016, NIPS.

[65] Yuval Tassa,et al. Continuous control with deep reinforcement learning , 2015, ICLR.

[66] Tom Schaul,et al. FeUdal Networks for Hierarchical Reinforcement Learning , 2017, ICML.

[67] Lydia E. Kavraki,et al. Probabilistic roadmaps for path planning in high-dimensional configuration spaces , 1996, IEEE Trans. Robotics Autom..