University of Groningen Approximating two value functions instead of one: towards characterizing a new family of Deep Reinforcement Learning algorithms Sabatelli,

This paper makes one step forward towards characterizing a new family of modelfree Deep Reinforcement Learning (DRL) algorithms. The aim of these algorithms is to jointly learn an approximation of the state-value function (V ), alongside an approximation of the state-action value function (Q). Our analysis starts with a thorough study of the Deep Quality-Value Learning (DQV) algorithm, a DRL algorithm which has been shown to outperform popular techniques such as DeepQ-Learning (DQN) and Double-Deep-Q-Learning (DDQN) [16]. Intending to investigate why DQV’s learning dynamics allow this algorithm to perform so well, we formulate a set of research questions which help us characterize a new family of DRL algorithms. Among our results, we present some specific cases in which DQV’s performance can get harmed and introduce a novel off-policy DRL algorithm, called DQV-Max, which can outperform DQV. We then study the behavior of the V and Q functions that are learned by DQV and DQV-Max and show that both algorithms might perform so well on several DRL test-beds because they are less prone to suffer from the overestimation bias of the Q function.

[1]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.

[2]  Andrew W. Moore,et al.  Generalization in Reinforcement Learning: Safely Approximating the Value Function , 1994, NIPS.

[3]  Tom Schaul,et al.  Rainbow: Combining Improvements in Deep Reinforcement Learning , 2017, AAAI.

[4]  Jürgen Schmidhuber,et al.  Deep learning in neural networks: An overview , 2014, Neural Networks.

[5]  Wenhao Yu,et al.  Supplementary material , 2015 .

[6]  Marco Wiering,et al.  The QV family compared to other reinforcement learning algorithms , 2009, 2009 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning.

[7]  Philip Bachman,et al.  Deep Reinforcement Learning that Matters , 2017, AAAI.

[8]  David Silver,et al.  Deep Reinforcement Learning with Double Q-Learning , 2015, AAAI.

[9]  Pieter Abbeel,et al.  Towards Characterizing Divergence in Deep Q-Learning , 2019, ArXiv.

[10]  Rich Caruana,et al.  Multitask Learning , 1998, Encyclopedia of Machine Learning and Data Mining.

[11]  Tom Schaul,et al.  Dueling Network Architectures for Deep Reinforcement Learning , 2015, ICML.

[12]  Shane Legg,et al.  Noisy Networks for Exploration , 2017, ICLR.

[13]  Herke van Hoof,et al.  Addressing Function Approximation Error in Actor-Critic Methods , 2018, ICML.

[14]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[15]  Long-Ji Lin,et al.  Reinforcement learning for robots using neural networks , 1992 .

[16]  Hado van Hasselt,et al.  Double Q-learning , 2010, NIPS.

[17]  Matteo Hessel,et al.  Deep Reinforcement Learning and the Deadly Triad , 2018, ArXiv.

[18]  Gilles Louppe,et al.  Deep Quality-Value (DQV) Learning , 2019, BNAIC/BENELEARN.

[19]  Mahesan Niranjan,et al.  On-line Q-learning using connectionist systems , 1994 .

[20]  Sergey Levine,et al.  Temporal Difference Models: Model-Free Deep RL for Model-Based Control , 2018, ICLR.

[21]  Marco Wiering QV(lambda)-learning: A New On-policy Reinforcement Learning Algrithm , 2005 .

[22]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[23]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[24]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[25]  Marc G. Bellemare,et al.  The Arcade Learning Environment: An Evaluation Platform for General Agents , 2012, J. Artif. Intell. Res..

[26]  Sean R Eddy,et al.  What is dynamic programming? , 2004, Nature Biotechnology.