Adaptive bases for Q-learning

We consider reinforcement learning, and in particular, the Q-learning algorithm in large state and action spaces. In order to cope with the size of the spaces, a function approximation approach to the state and action value function is needed. We generalize the classical Q-learning algorithm to an algorithm where the basis of the linear function approximation change dynamically while interacting with the environment. A motivation for such an approach is maximizing the state-action value function fitness to the problem faced, thus obtaining better performance. The algorithm is shown to converge using two time scales stochastic approximation. Finally, we discuss how this technique can be applied to a rich family of RL algorithms with linear function approximation.

[1]  Peter L. Bartlett,et al.  Infinite-Horizon Policy-Gradient Estimation , 2001, J. Artif. Intell. Res..

[2]  H. Kushner,et al.  Stochastic Approximation and Recursive Algorithms and Applications , 2003 .

[3]  John N. Tsitsiklis,et al.  Optimal stopping of Markov processes: Hilbert space theory, approximation algorithms, and an application to pricing high-dimensional financial derivatives , 1999, IEEE Trans. Autom. Control..

[4]  Simon Haykin,et al.  Neural Networks: A Comprehensive Foundation , 1998 .

[5]  John N. Tsitsiklis,et al.  Analysis of temporal-difference learning with function approximation , 1996, NIPS 1996.

[6]  Shalabh Bhatnagar,et al.  Fast gradient-descent methods for temporal-difference learning with linear function approximation , 2009, ICML '09.

[7]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control 3rd Edition, Volume II , 2010 .

[8]  Francisco S. Melo,et al.  Convergence of Q-learning with linear function approximation , 2007, 2007 European Control Conference (ECC).

[9]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[10]  V. Borkar Stochastic approximation with two time scales , 1997 .

[11]  Shie Mannor,et al.  Basis Function Adaptation in Temporal Difference Reinforcement Learning , 2005, Ann. Oper. Res..

[12]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[13]  Dimitri P. Bertsekas,et al.  Basis function adaptation methods for cost approximation in MDP , 2009, 2009 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning.

[14]  Shalabh Bhatnagar,et al.  Natural actor-critic algorithms , 2009, Autom..

[15]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[16]  John Odentrantz,et al.  Markov Chains: Gibbs Fields, Monte Carlo Simulation, and Queues , 2000, Technometrics.

[17]  Leemon C. Baird,et al.  Residual Algorithms: Reinforcement Learning with Function Approximation , 1995, ICML.

[18]  Chris Watkins,et al.  Learning from delayed rewards , 1989 .

[19]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.