Constrained Deep Q-Learning Gradually Approaching Ordinary Q-Learning

A deep Q network (DQN) (Mnih et al., 2013) is an extension of Q learning, which is a typical deep reinforcement learning method. In DQN, a Q function expresses all action values under all states, and it is approximated using a convolutional neural network. Using the approximated Q function, an optimal policy can be derived. In DQN, a target network, which calculates a target value and is updated by the Q function at regular intervals, is introduced to stabilize the learning process. A less frequent updates of the target network would result in a more stable learning process. However, because the target value is not propagated unless the target network is updated, DQN usually requires a large number of samples. In this study, we proposed Constrained DQN that uses the difference between the outputs of the Q function and the target network as a constraint on the target value. Constrained DQN updates parameters conservatively when the difference between the outputs of the Q function and the target network is large, and it updates them aggressively when this difference is small. In the proposed method, as learning progresses, the number of times that the constraints are activated decreases. Consequently, the update method gradually approaches conventional Q learning. We found that Constrained DQN converges with a smaller training dataset than in the case of DQN and that it is robust against changes in the update frequency of the target network and settings of a certain parameter of the optimizer. Although Constrained DQN alone does not show better performance in comparison to integrated approaches nor distributed methods, experimental results show that Constrained DQN can be used as an additional components to those methods.

[1]  Kenji Doya,et al.  Theoretical Analysis of Efficiency and Robustness of Softmax and Gap-Increasing Operators in Reinforcement Learning , 2019, AISTATS.

[2]  Rémi Munos,et al.  Observe and Look Further: Achieving Consistent Performance on Atari , 2018, ArXiv.

[3]  Kavosh Asadi,et al.  DeepMellow: Removing the Need for a Target Network in Deep Q-Learning , 2019, IJCAI.

[4]  John N. Tsitsiklis,et al.  Analysis of temporal-difference learning with function approximation , 1996, NIPS 1996.

[5]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[6]  Tom Schaul,et al.  Dueling Network Architectures for Deep Reinforcement Learning , 2015, ICML.

[7]  Long-Ji Lin,et al.  Reinforcement learning for robots using neural networks , 1992 .

[8]  Marc G. Bellemare,et al.  Increasing the Action Gap: New Operators for Reinforcement Learning , 2015, AAAI.

[9]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[10]  Sergey Levine,et al.  Reinforcement Learning with Deep Energy-Based Policies , 2017, ICML.

[11]  Hado van Hasselt,et al.  Double Q-learning , 2010, NIPS.

[12]  Alex Graves,et al.  Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[13]  Tom Schaul,et al.  Rainbow: Combining Improvements in Deep Reinforcement Learning , 2017, AAAI.

[14]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[15]  Marcin Andrychowicz,et al.  Parameter Space Noise for Exploration , 2017, ICLR.

[16]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.

[17]  Matteo Hessel,et al.  When to use parametric models in reinforcement learning? , 2019, NeurIPS.

[18]  Sergey Levine,et al.  Learning Hand-Eye Coordination for Robotic Grasping with Large-Scale Data Collection , 2016, ISER.

[19]  Demis Hassabis,et al.  Mastering the game of Go without human knowledge , 2017, Nature.

[20]  Kenji Doya,et al.  From free energy to expected energy: Improving energy-based value function approximation in reinforcement learning , 2016, Neural Networks.

[21]  Marcin Andrychowicz,et al.  Hindsight Experience Replay , 2017, NIPS.

[22]  Herke van Hoof,et al.  Addressing Function Approximation Error in Actor-Critic Methods , 2018, ICML.

[23]  Richard S. Sutton,et al.  Understanding Multi-Step Deep Reinforcement Learning: A Systematic Study of the DQN Target , 2019, ArXiv.

[24]  Martin A. Riedmiller Neural Fitted Q Iteration - First Experiences with a Data Efficient Neural Reinforcement Learning Method , 2005, ECML.

[25]  Leemon C. Baird,et al.  Residual Algorithms: Reinforcement Learning with Function Approximation , 1995, ICML.

[26]  Rémi Munos,et al.  Recurrent Experience Replay in Distributed Reinforcement Learning , 2018, ICLR.

[27]  K. Shadan,et al.  Available online: , 2012 .

[28]  Peter Dayan,et al.  Technical Note: Q-Learning , 2004, Machine Learning.

[29]  Alex Graves,et al.  Generating Sequences With Recurrent Neural Networks , 2013, ArXiv.

[30]  Pieter Abbeel,et al.  Towards Characterizing Divergence in Deep Q-Learning , 2019, ArXiv.

[31]  Sergey Levine,et al.  PLATO: Policy learning using adaptive trajectory optimization , 2016, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[32]  David Silver,et al.  Deep Reinforcement Learning with Double Q-Learning , 2015, AAAI.

[33]  Nahum Shimkin,et al.  Averaged-DQN: Variance Reduction and Stabilization for Deep Reinforcement Learning , 2016, ICML.

[34]  Yuxi Li,et al.  Deep Reinforcement Learning , 2018, Reinforcement Learning for Cyber-Physical Systems.

[35]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[36]  Takamitsu Matsubara,et al.  Deep reinforcement learning with smooth policy update: Application to robotic cloth manipulation , 2019, Robotics Auton. Syst..

[37]  Yang Liu,et al.  Learning to Play in a Day: Faster Deep Reinforcement Learning by Optimality Tightening , 2016, ICLR.

[38]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[39]  Tom Schaul,et al.  Prioritized Experience Replay , 2015, ICLR.

[40]  Thommen George Karimpanal,et al.  Experience Replay Using Transition Sequences , 2017, Front. Neurorobot..

[41]  Sergey Levine,et al.  Continuous Deep Q-Learning with Model-based Acceleration , 2016, ICML.

[42]  Zhuoran Yang,et al.  A Theoretical Analysis of Deep Q-Learning , 2019, L4DC.

[43]  Anind K. Dey,et al.  Maximum Entropy Inverse Reinforcement Learning , 2008, AAAI.

[44]  Mykel J. Kochenderfer,et al.  Weighted Double Q-learning , 2017, IJCAI.

[45]  David Budden,et al.  Distributed Prioritized Experience Replay , 2018, ICLR.

[46]  Juan F. García Unifying n-Step Temporal-Difference Action-Value Methods , 2019 .

[47]  Peter Stone,et al.  TD Learning with Constrained Gradients , 2018 .

[48]  Shane Legg,et al.  Noisy Networks for Exploration , 2017, ICLR.