Successive Over Relaxation Q-Learning

In a discounted reward Markov decision process (MDP), the objective is to find the optimal value function, i.e., the value function corresponding to an optimal policy. This problem reduces to solving a functional equation known as the Bellman equation and a fixed point iteration scheme known as the value iteration is utilized to obtain the solution. In literature, a successive over-relaxation (SOR)-based value iteration scheme is proposed to speed-up the computation of the optimal value function. The speed-up is achieved by constructing a modified Bellman equation that ensures faster convergence to the optimal value function. However, in many practical applications, the model information is not known and we resort to reinforcement learning (RL) algorithms to obtain optimal policy and value function. One such popular algorithm is Q -learning. In this letter, we propose SOR Q -learning. We first derive a modified fixed point iteration for SOR Q -values and utilize stochastic approximation to derive a learning algorithm to compute the optimal value function and an optimal policy. We then prove the almost sure convergence of the SOR Q -learning to SOR Q -values. Finally, through numerical experiments, we show that SOR Q -learning is faster compared to the standard Q -learning algorithm.

[1]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[2]  Sean P. Meyn,et al.  Zap Q-Learning , 2017, NIPS.

[3]  D. Blackwell Discrete Dynamic Programming , 1962 .

[4]  Jing Peng,et al.  Incremental multi-step Q-learning , 1994, Machine Learning.

[5]  Michael I. Jordan,et al.  MASSACHUSETTS INSTITUTE OF TECHNOLOGY ARTIFICIAL INTELLIGENCE LABORATORY and CENTER FOR BIOLOGICAL AND COMPUTATIONAL LEARNING DEPARTMENT OF BRAIN AND COGNITIVE SCIENCES , 1996 .

[6]  E. Denardo CONTRACTION MAPPINGS IN THE THEORY UNDERLYING DYNAMIC PROGRAMMING , 1967 .

[7]  Yishay Mansour,et al.  Learning Rates for Q-learning , 2004, J. Mach. Learn. Res..

[8]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.

[9]  Hilbert J. Kappen,et al.  Speedy Q-Learning , 2011, NIPS.

[10]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[11]  Hado van Hasselt,et al.  Double Q-learning , 2010, NIPS.

[12]  Alex Graves,et al.  Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[13]  V. Borkar Stochastic Approximation: A Dynamical Systems Viewpoint , 2008 .

[14]  Sean R Eddy,et al.  What is dynamic programming? , 2004, Nature Biotechnology.

[15]  Dieter Reetz,et al.  Solution of a Markovian decision problem by successive overrelaxation , 1973, Z. Oper. Research.

[16]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[17]  Wenjie Huang,et al.  Risk-aware Q-learning for Markov decision processes , 2017, 2017 IEEE 56th Annual Conference on Decision and Control (CDC).

[18]  Richard S. Sutton,et al.  Introduction to Reinforcement Learning , 1998 .