Neural Value Function Approximation in Continuous State Reinforcement Learning Problems

Recent development of Deep Reinforcement Learning (DRL) has demonstrated superior performance of neural networks in solving challenging problems with large or continuous state spaces. In this work, we focus on the problem of minimising the expected one step Temporal Difference (TD) error with neural function approximator for a continuous state space, from a smooth optimisation perspective. An approximate Newton’s algorithm is proposed. Effectiveness of the algorithm is demonstrated on both finite and continuous state space benchmarks. We show that, in order to benefit from the second order approximate Newton’s algorithm, gradient of the TD target needs to be considered for training.

[1]  Shalabh Bhatnagar,et al.  Fast gradient-descent methods for temporal-difference learning with linear function approximation , 2009, ICML '09.

[2]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[3]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[4]  Martin A. Riedmiller Neural Fitted Q Iteration - First Experiences with a Data Efficient Neural Reinforcement Learning Method , 2005, ECML.

[5]  Shalabh Bhatnagar,et al.  Convergent Temporal-Difference Learning with Arbitrary Smooth Function Approximation , 2009, NIPS.

[6]  David Silver,et al.  Gradient Temporal Difference Networks , 2012, EWRL.

[7]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[8]  Dong Yu,et al.  Automatic Speech Recognition: A Deep Learning Approach , 2014 .

[9]  Leemon C. Baird,et al.  Residual Algorithms: Reinforcement Learning with Function Approximation , 1995, ICML.

[10]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[11]  Hao Shen,et al.  Towards a Mathematical Understanding of the Difficulty in Learning with Feedforward Neural Networks , 2016, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[12]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[13]  Robert E. Mahony,et al.  Optimization Algorithms on Matrix Manifolds , 2007 .

[14]  Richard S. Sutton,et al.  A Convergent O(n) Temporal-difference Algorithm for Off-policy Learning with Linear Function Approximation , 2008, NIPS.