论文信息 - Forward Actor-Critic for Nonlinear Function Approximation in Reinforcement Learning

Forward Actor-Critic for Nonlinear Function Approximation in Reinforcement Learning

Multi-step methods are important in reinforcement learn- ing (RL). Eligibility traces, the usual way of handling them, works well with linear function approximators. Recently, van Seijen (2016) had introduced a delayed learning approach, without eligibility traces, for handling the multi-step λ-return with nonlinear function approximators. However, this was limited to action-value methods. In this paper, we extend this approach to handle n-step returns, generalize this approach to policy gradient methods and empirically study the effect of such delayed updates in control tasks. Specifically, we introduce two novel forward actor- critic methods and empirically investigate our proposed methods with the conventional actor-critic method on mountain car and pole-balancing tasks. From our experiments, we observe that forward actor-critic dramatically outperforms the conventional actor-critic in these standard control tasks. Notably, this forward actor-critic method has produced a new class of multi-step RL algorithms without eligibility traces.

[1] Richard S. Sutton,et al. Neuronlike adaptive elements that can solve difficult learning control problems , 1983, IEEE Transactions on Systems, Man, and Cybernetics.

[2] Richard S. Sutton,et al. Temporal credit assignment in reinforcement learning , 1984 .

[3] C. Watkins. Learning from delayed rewards , 1989 .

[4] Pawel Cichosz,et al. Truncating Temporal Differences: On the Efficient Implementation of TD(lambda) for Reinforcement Learning , 1994, J. Artif. Intell. Res..

[5] Dimitri P. Bertsekas,et al. Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[6] Pawea Cichosz. Truncating Temporal Diierences: on the Eecient Implementation of Td() for Reinforcement Learning , 1995 .

[7] David K. Smith,et al. Dynamic Programming and Optimal Control. Volume 1 , 1996 .

[8] Richard S. Sutton,et al. Introduction to Reinforcement Learning , 1998 .

[9] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[10] Richard S. Sutton,et al. Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[11] Csaba Szepesvári,et al. Algorithms for Reinforcement Learning , 2010, Synthesis Lectures on Artificial Intelligence and Machine Learning.

[12] Guy Lever,et al. Deterministic Policy Gradient Algorithms , 2014, ICML.

[13] Yuval Tassa,et al. Continuous control with deep reinforcement learning , 2015, ICLR.

[14] Sergey Levine,et al. Guided Cost Learning: Deep Inverse Optimal Control via Policy Optimization , 2016, ICML.

[15] Pieter Abbeel,et al. Benchmarking Deep Reinforcement Learning for Continuous Control , 2016, ICML.

[16] Sergey Levine,et al. Learning deep neural network policies with continuous memory states , 2015, 2016 IEEE International Conference on Robotics and Automation (ICRA).

[17] Harm van Seijen,et al. Effective Multi-step Temporal-Difference Learning for Non-Linear Function Approximation , 2016, ArXiv.

[18] Alex Graves,et al. Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.