论文信息 - Learning Continuous Control Policies by Stochastic Value Gradients

Learning Continuous Control Policies by Stochastic Value Gradients

We present a unified framework for learning continuous control policies using backpropagation. It supports stochastic control by treating stochasticity in the Bellman equation as a deterministic function of exogenous noise. The product is a spectrum of general policy gradient algorithms that range from model-free methods with value functions to model-based methods without value functions. We use learned models but only require observations from the environment in- stead of observations from model-predicted trajectories, minimizing the impact of compounded model errors. We apply these algorithms first to a toy stochastic control problem and then to several physics-based control problems in simulation. One of these variants, SVG(1), shows the effectiveness of learning models, value functions, and policies simultaneously in continuous domains.

[1] David Q. Mayne,et al. Differential dynamic programming , 1972, The Mathematical Gazette.

[2] B. Widrow,et al. Neural networks for self-learning control systems , 1990, IEEE Control Systems Magazine.

[3] Michael I. Jordan,et al. Forward Models: Supervised Learning with a Distal Teacher , 1992, Cogn. Sci..

[4] Michael I. Jordan,et al. Learning Without State-Estimation in Partially Observable Markovian Decision Processes , 1994, ICML.

[5] Richard S. Sutton,et al. A Menu of Designs for Reinforcement Learning Over Time , 1995 .

[6] Leemon C. Baird,et al. Residual Algorithms: Reinforcement Learning with Function Approximation , 1995, ICML.

[7] Richard D. Braatz,et al. On the "Identification and control of dynamical systems using neural networks" , 1997, IEEE Trans. Neural Networks.

[8] Yishay Mansour,et al. Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[9] Rémi Coulom,et al. Reinforcement Learning Using Neural Networks, with Applications to Motor Control. (Apprentissage par renforcement utilisant des réseaux de neurones, avec des applications au contrôle moteur) , 2002 .

[10] Ronald J. Williams,et al. Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[11] Longxin Lin. Self-Improving Reactive Agents Based on Reinforcement Learning, Planning and Teaching , 2004, Machine Learning.

[12] Martin A. Riedmiller. Neural Fitted Q Iteration - First Experiences with a Data Efficient Neural Reinforcement Learning Method , 2005, ECML.

[13] Richard S. Sutton,et al. Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[14] Rémi Munos,et al. Policy Gradient in Continuous Time , 2006, J. Mach. Learn. Res..

[15] Pieter Abbeel,et al. Using inaccurate models in reinforcement learning , 2006, ICML.

[16] William D. Smart,et al. Receding Horizon Differential Dynamic Programming , 2007, NIPS.

[17] Pawel Wawrzynski,et al. A Cat-Like Robot Real-Time Learning to Run , 2009, ICANNGA.

[18] Pawel Wawrzynski,et al. Real-time reinforcement learning by sequential Actor-Critics and experience replay , 2009, Neural Networks.

[19] Carl E. Rasmussen,et al. PILCO: A Model-Based and Data-Efficient Approach to Policy Search , 2011, ICML.

[20] Michael Fairbank,et al. Value-gradient learning , 2012, The 2012 International Joint Conference on Neural Networks (IJCNN).

[21] Christopher G. Atkeson,et al. Efficient robust policy optimization , 2012, 2012 American Control Conference (ACC).

[22] Yuval Tassa,et al. MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[23] Razvan Pascanu,et al. On the difficulty of training recurrent neural networks , 2012, ICML.

[24] Daan Wierstra,et al. Stochastic Backpropagation and Approximate Inference in Deep Generative Models , 2014, ICML.

[25] Max Welling,et al. Auto-Encoding Variational Bayes , 2013, ICLR.

[26] Guy Lever,et al. Deterministic Policy Gradient Algorithms , 2014, ICML.

[27] Sergey Levine,et al. Learning Neural Network Policies with Guided Policy Search under Unknown Dynamics , 2014, NIPS.

[28] Muhammad Ghifary,et al. Compatible Value Gradients for Reinforcement Learning of Continuous Deep Policies , 2015, ArXiv.

[29] I. Grondman,et al. Online Model Learning Algorithms for Actor-Critic Control , 2015 .

[30] Sergey Levine,et al. Trust Region Policy Optimization , 2015, ICML.

[31] Yuval Tassa,et al. Continuous control with deep reinforcement learning , 2015, ICLR.