论文信息 - Compatible Value Gradients for Reinforcement Learning of Continuous Deep Policies

Compatible Value Gradients for Reinforcement Learning of Continuous Deep Policies

This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.

Muhammad Ghifary | David Balduzzi | D. Balduzzi | Muhammad Ghifary

[1] Kagan Tumer,et al. Analyzing and visualizing multiagent rewards in dynamic and stochastic domains , 2008, Autonomous Agents and Multi-Agent Systems.

[2] Geoffrey E. Hinton,et al. On rectified linear units for speech processing , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[3] Jennie Si,et al. Online learning control by association and reinforcement , 2000, Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks. IJCNN 2000. Neural Computing: New Challenges and Perspectives for the New Millennium.

[4] Joachim M. Buhmann,et al. Kickback Cuts Backprop's Red-Tape: Biologically Plausible Credit Assignment in Neural Networks , 2014, AAAI.

[5] Jan Peters,et al. Policy evaluation with temporal differences: a survey and comparison , 2015, J. Mach. Learn. Res..

[6] Nitish Srivastava,et al. Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[7] Michael Fairbank,et al. Value-gradient learning , 2012, The 2012 International Joint Conference on Neural Networks (IJCNN).

[8] R. Sutton,et al. A convergent O ( n ) algorithm for off-policy temporal-difference learning with linear function approximation , 2008, NIPS 2008.

[9] DarrellTrevor,et al. End-to-end training of deep visuomotor policies , 2016 .

[10] John Darzentas,et al. Problem Complexity and Method Efficiency in Optimization , 1983 .

[11] Maxim Raginsky,et al. Information-Based Complexity, Feedback and Dynamics in Convex Programming , 2010, IEEE Transactions on Information Theory.

[12] Duy Nguyen-Tuong,et al. Local Gaussian Process Regression for Real Time Online Model Learning , 2008, NIPS.

[13] Razvan Pascanu,et al. Theano: new features and speed improvements , 2012, ArXiv.

[14] Shane Legg,et al. Human-level control through deep reinforcement learning , 2015, Nature.

[15] John Langford,et al. Doubly Robust Policy Evaluation and Optimization , 2014, ArXiv.

[16] Yoshua Bengio,et al. Deep Sparse Rectifier Neural Networks , 2011, AISTATS.

[17] Abdeslam Boularias,et al. Gradient Weights help Nonparametric Regressors , 2012, NIPS.

[18] Shalabh Bhatnagar,et al. Fast gradient-descent methods for temporal-difference learning with linear function approximation , 2009, ICML '09.

[19] Ronald J. Williams,et al. Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[20] Tara N. Sainath,et al. Improving deep neural networks for LVCSR using rectified linear units and dropout , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[21] Martin A. Riedmiller,et al. Reinforcement learning in feedback control , 2011, Machine Learning.

[22] Jan Peters,et al. A Survey on Policy Search for Robotics , 2013, Found. Trends Robotics.

[23] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[24] Guy Lever,et al. Deterministic Policy Gradient Algorithms , 2014, ICML.

[25] Kagan Tumer,et al. Combining reward shaping and hierarchies for scaling to large multiagent systems , 2016, The Knowledge Engineering Review.

[26] Kagan Tumer,et al. Unifying temporal and structural credit assignment problems , 2004, Proceedings of the Third International Joint Conference on Autonomous Agents and Multiagent Systems, 2004. AAMAS 2004..

[27] Yishay Mansour,et al. Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[28] Richard S. Sutton,et al. Neuronlike adaptive elements that can solve difficult learning control problems , 1983, IEEE Transactions on Systems, Man, and Cybernetics.

[29] Gregory Shakhnarovich,et al. A Consistent Estimator of the Expected Gradient Outerproduct , 2014, UAI.

[30] Thomas B. Schön,et al. From Pixels to Torques: Policy Learning with Deep Dynamical Models , 2015, ICML 2015.

[31] Sham M. Kakade,et al. A Natural Policy Gradient , 2001, NIPS.

[32] David Balduzzi,et al. Deep Online Convex Optimization by Putting Forecaster to Sleep , 2015, ArXiv.

[33] Michail G. Lagoudakis,et al. Coordinated Reinforcement Learning , 2002, ICML.

[34] Stefan Schaal,et al. Policy Gradient Methods for Robotics , 2006, 2006 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[35] Leemon C. Baird,et al. Residual Algorithms: Reinforcement Learning with Function Approximation , 1995, ICML.

[36] Peter Szabó,et al. Learning to Control an Octopus Arm with Gaussian Process Temporal Difference Methods , 2005, NIPS.

[37] John N. Tsitsiklis,et al. Analysis of temporal-difference learning with function approximation , 1996, NIPS 1996.

[38] Adam Tauman Kalai,et al. Online convex optimization in the bandit setting: gradient descent without a gradient , 2004, SODA '05.

[39] John N. Tsitsiklis,et al. Actor-Critic Algorithms , 1999, NIPS.

[40] Michael Fairbank,et al. An Equivalence Between Adaptive Dynamic Programming With a Critic and Backpropagation Through Time , 2013, IEEE Transactions on Neural Networks and Learning Systems.

[41] Geoffrey E. Hinton,et al. Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[42] Michael I. Jordan,et al. Learning to Control an Unstable System with Forward Modeling , 1989, NIPS.

[43] Sergey Levine,et al. End-to-End Training of Deep Visuomotor Policies , 2015, J. Mach. Learn. Res..