Two-Timescale Networks for Nonlinear Value Function Approximation

A key component for many reinforcement learning agents is to learn a value function, either for policy evaluation or control. Many of the algorithms for learning values, however, are designed for linear function approximation—with a fixed basis or fixed representation. Though there have been a few sound extensions to nonlinear function approximation, such as nonlinear gradient temporal difference learning, these methods have largely not been adopted, eschewed in favour of simpler but not sound methods like temporal difference learning and Q-learning. In this work, we provide a two-timescale network (TTN) architecture that enables linear methods to be used to learn values, with a nonlinear representation learned at a slower timescale. The approach facilitates the use of algorithms developed for the linear setting, such as data-efficient least-squares methods, eligibility traces and the myriad of recently developed linear policy evaluation algorithms, to provide nonlinear value estimates. We prove convergence for TTNs, with particular care given to ensure convergence of the fast linear component under potentially dependent features provided by the learned representation. We empirically demonstrate the benefits of TTNs, compared to other nonlinear value function approximation algorithms, both for policy evaluation and control.

[1]  Huizhen Yu,et al.  On Convergence of Emphatic Temporal-Difference Learning , 2015, COLT.

[2]  Tom Schaul,et al.  Rainbow: Combining Improvements in Deep Reinforcement Learning , 2017, AAAI.

[3]  Richard S. Sutton,et al.  Off-policy TD( l) with a true online equivalence , 2014, UAI.

[4]  S. Bhatnagar,et al.  Stochastic recursive inclusion in two timescales with an application to the Lagrangian dual problem , 2015 .

[5]  Rich Sutton,et al.  A Deeper Look at Planning as Learning from Replay , 2015, ICML.

[6]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[7]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[8]  Jan Peters,et al.  Policy evaluation with temporal differences: a survey and comparison , 2015, J. Mach. Learn. Res..

[9]  R. Sutton,et al.  Gradient temporal-difference learning algorithms , 2011 .

[10]  Tamer Basar,et al.  Analysis of Recursive Stochastic Algorithms , 2001 .

[11]  Andrew G. Barto,et al.  Linear Least-Squares Algorithms for Temporal Difference Learning , 2005, Machine Learning.

[12]  Pierre Geurts,et al.  Tree-Based Batch Mode Reinforcement Learning , 2005, J. Mach. Learn. Res..

[13]  Elizabeth L. Wilmer,et al.  Markov Chains and Mixing Times , 2008 .

[14]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[15]  Honglak Lee,et al.  Action-Conditional Video Prediction using Deep Networks in Atari Games , 2015, NIPS.

[16]  Shalabh Bhatnagar,et al.  Fast gradient-descent methods for temporal-difference learning with linear function approximation , 2009, ICML '09.

[17]  Nando de Freitas,et al.  Playing hard exploration games by watching YouTube , 2018, NeurIPS.

[18]  Richard S. Sutton,et al.  True Online TD(lambda) , 2014, ICML.

[19]  Nikos Komodakis,et al.  Unsupervised Representation Learning by Predicting Image Rotations , 2018, ICLR.

[20]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[21]  Lawrence Carin,et al.  Linear Feature Encoding for Reinforcement Learning , 2016, NIPS.

[22]  Josef Hofbauer,et al.  Stochastic Approximations and Differential Inclusions , 2005, SIAM J. Control. Optim..

[23]  John N. Tsitsiklis,et al.  Analysis of temporal-difference learning with function approximation , 1996, NIPS 1996.

[24]  Martha White,et al.  Effective sketching methods for value function approximation , 2017, UAI.

[25]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[26]  G. Pflug Stochastic Approximation Methods for Constrained and Unconstrained Systems - Kushner, HJ.; Clark, D.S. , 1980 .

[27]  Sanjiv Kumar,et al.  On the Convergence of Adam and Beyond , 2018 .

[28]  George Konidaris,et al.  Value Function Approximation in Reinforcement Learning Using the Fourier Basis , 2011, AAAI.

[29]  V. Borkar Stochastic Approximation: A Dynamical Systems Viewpoint , 2008 .

[30]  Shie Mannor,et al.  Shallow Updates for Deep Reinforcement Learning , 2017, NIPS.

[31]  Martha White,et al.  An Emphatic Approach to the Problem of Off-policy Temporal-Difference Learning , 2015, J. Mach. Learn. Res..

[32]  Tom Schaul,et al.  Reinforcement Learning with Unsupervised Auxiliary Tasks , 2016, ICLR.

[33]  Le Song,et al.  Smoothed Dual Embedding Control , 2017, ArXiv.

[34]  Shie Mannor,et al.  Adaptive Bases for Reinforcement Learning , 2010, ECML/PKDD.

[35]  Richard S. Sutton,et al.  Multi-step Off-policy Learning Without Importance Sampling Ratios , 2017, ArXiv.

[36]  Martha White,et al.  Investigating Practical Linear Temporal Difference Learning , 2016, AAMAS.

[37]  Martha White,et al.  Unifying Task Specification in Reinforcement Learning , 2016, ICML.

[38]  Han-Fu Chen Stochastic approximation and its applications , 2002 .

[39]  Vivek S. Borkar,et al.  Actor-Critic Algorithms with Online Feature Adaptation , 2016, ACM Trans. Model. Comput. Simul..

[40]  Dimitri P. Bertsekas,et al.  Basis function adaptation methods for cost approximation in MDP , 2009, 2009 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning.

[41]  Bo Liu,et al.  Proximal Reinforcement Learning: A New Theory of Sequential Decision Making in Primal-Dual Spaces , 2014, ArXiv.

[42]  Vivek S. Borkar,et al.  Feature Search in the Grassmanian in Online Reinforcement Learning , 2013, IEEE Journal of Selected Topics in Signal Processing.

[43]  H. Kushner,et al.  Stochastic Approximation and Recursive Algorithms and Applications , 2003 .

[44]  V. Borkar Stochastic approximation with two time scales , 1997 .

[45]  Shie Mannor,et al.  Basis Function Adaptation in Temporal Difference Reinforcement Learning , 2005, Ann. Oper. Res..

[46]  Richard S. Sutton,et al.  Off-policy learning based on weighted importance sampling with linear computational complexity , 2015, UAI.

[47]  Shalabh Bhatnagar,et al.  Convergent Temporal-Difference Learning with Arbitrary Smooth Function Approximation , 2009, NIPS.

[48]  Csaba Szepesvári,et al.  Algorithms for Reinforcement Learning , 2010, Synthesis Lectures on Artificial Intelligence and Machine Learning.