论文信息 - Temporal Difference Models: Model-Free Deep RL for Model-Based Control

Temporal Difference Models: Model-Free Deep RL for Model-Based Control

Model-free reinforcement learning (RL) is a powerful, general tool for learning complex behaviors. However, its sample efficiency is often impractically large for solving challenging real-world problems, even with off-policy algorithms such as Q-learning. A limiting factor in classic model-free RL is that the learning signal consists only of scalar rewards, ignoring much of the rich information contained in state transition tuples. Model-based RL uses this information, by training a predictive model, but often does not achieve the same asymptotic performance as model-free RL due to model bias. We introduce temporal difference models (TDMs), a family of goal-conditioned value functions that can be trained with model-free learning and used for model-based control. TDMs combine the benefits of model-free and model-based RL: they leverage the rich information in state transitions to learn very efficiently, while still attaining asymptotic performance that exceeds that of direct model-based RL methods. Our experimental results show that, on a range of continuous control tasks, TDMs provide a substantial improvement in efficiency compared to state-of-the-art model-based and model-free methods.

[1] Marc G. Bellemare,et al. A Distributional Perspective on Reinforcement Learning , 2017, ICML.

[2] Pieter Abbeel,et al. Benchmarking Deep Reinforcement Learning for Continuous Control , 2016, ICML.

[3] Richard S. Sutton,et al. Integrated Architectures for Learning, Planning, and Reacting Based on Approximating Dynamic Programming , 1990, ML.

[4] Justin A. Boyan,et al. Least-Squares Temporal Difference Learning , 1999, ICML.

[5] Peter Dayan,et al. Structure in the Space of Value Functions , 2002, Machine Learning.

[6] Sergey Levine,et al. Learning deep control policies for autonomous aerial vehicles with MPC-guided policy search , 2015, 2016 IEEE International Conference on Robotics and Automation (ICRA).

[7] Demis Hassabis,et al. Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[8] Martial Hebert,et al. Improved Learning of Dynamics Models for Control , 2016, ISER.

[9] Sergey Levine,et al. Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates , 2016, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[10] Marcin Andrychowicz,et al. Hindsight Experience Replay , 2017, NIPS.

[11] Tom Schaul,et al. Universal Value Function Approximators , 2015, ICML.

[12] Vladlen Koltun,et al. Learning to Act by Predicting the Future , 2016, ICLR.

[13] Yuval Tassa,et al. Continuous control with deep reinforcement learning , 2015, ICLR.

[14] B. Widrow,et al. Neural networks for self-learning control systems , 1990, IEEE Control Systems Magazine.

[15] Sergey Levine,et al. Continuous Deep Q-Learning with Model-based Acceleration , 2016, ICML.

[16] Sergey Levine,et al. Goal-driven dynamics learning via Bayesian optimization , 2017, 2017 IEEE 56th Annual Conference on Decision and Control (CDC).

[17] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[18] Lihong Li,et al. An analysis of linear models, linear value-function approximation, and feature selection for reinforcement learning , 2008, ICML '08.

[19] Yuval Tassa,et al. Learning Continuous Control Policies by Stochastic Value Gradients , 2015, NIPS.

[20] Sergey Levine,et al. End-to-End Training of Deep Visuomotor Policies , 2015, J. Mach. Learn. Res..

[21] Tom Schaul,et al. The Predictron: End-To-End Learning and Planning , 2016, ICML.

[22] Navdeep Jaitly,et al. Discrete Sequential Prediction of Continuous Actions for Deep RL , 2017, ArXiv.

[23] Patrick M. Pilarski,et al. Horde: a scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction , 2011, AAMAS.

[24] Pieter Abbeel,et al. Prediction and Control with Temporal Segment Models , 2017, ICML.

[25] Sergey Levine,et al. Neural Network Dynamics for Model-Based Deep Reinforcement Learning with Model-Free Fine-Tuning , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[26] Yuval Tassa,et al. MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[27] Bruno Castro da Silva,et al. Learning Parameterized Skills , 2012, ICML.

[28] Jan Peters,et al. Nonamemanuscript No. (will be inserted by the editor) Reinforcement Learning to Adjust Parametrized Motor Primitives to , 2011 .

[29] Tom Schaul,et al. Reinforcement Learning with Unsupervised Auxiliary Tasks , 2016, ICLR.

[30] Gaurav S. Sukhatme,et al. Combining Model-Based and Model-Free Updates for Trajectory-Centric Reinforcement Learning , 2017, ICML.

[31] Shane Legg,et al. Human-level control through deep reinforcement learning , 2015, Nature.

[32] Carl E. Rasmussen,et al. PILCO: A Model-Based and Data-Efficient Approach to Policy Search , 2011, ICML.

[33] Jan Peters,et al. A Survey on Policy Search for Robotics , 2013, Found. Trends Robotics.

[34] Martin A. Riedmiller. Neural Fitted Q Iteration - First Experiences with a Data Efficient Neural Reinforcement Learning Method , 2005, ECML.

[35] Peter Dayan,et al. Q-learning , 1992, Machine Learning.