论文信息 - Variation-resistant Q-learning: Controlling and Utilizing Estimation Bias in Reinforcement Learning for Better Performance

Variation-resistant Q-learning: Controlling and Utilizing Estimation Bias in Reinforcement Learning for Better Performance

Q-learning is a reinforcement learning algorithm that has overestimation bias, because it learns the optimal action values by using a target that maximizes over uncertain action-value estimates. Although the overestimation bias of Q-learning is generally considered harmful, a recent study suggests that it could be either harmful or helpful depending on the reinforcement learning problem. In this paper, we propose a new Q-learning variant, called Variation-resistant Q-learning, to control and utilize estimation bias for better performance. Firstly, we present the tabular version of the algorithm and mathematically prove its convergence. Secondly, we combine the algorithm with function approximation. Finally, we present empirical results from three different experiments, in which we compared the performance of Variation-resistant Q-learning, Q-learning, and Double Q-learning. The empirical results show that Variation-resistant Q-learning can control and utilize estimation bias for better performance in the experimental tasks.

Marco Wiering | Andreas Pentaliotis

[1] Tommi S. Jaakkola,et al. Convergence Results for Single-Step On-Policy Reinforcement-Learning Algorithms , 2000, Machine Learning.

[2] Mykel J. Kochenderfer,et al. Weighted Double Q-learning , 2017, IJCAI.

[3] Hado van Hasselt,et al. Double Q-learning , 2010, NIPS.

[4] Warren B. Powell,et al. Bias-corrected Q-learning to control max-operator bias in Q-learning , 2013, 2013 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL).

[5] Sebastian Thrun,et al. Issues in Using Function Approximation for Reinforcement Learning , 1999 .

[6] David Silver,et al. Deep Reinforcement Learning with Double Q-Learning , 2015, AAAI.

[7] Matteo Hessel,et al. Deep Reinforcement Learning and the Deadly Triad , 2018, ArXiv.

[8] Marcello Restelli,et al. Estimating Maximum Expected Value through Gaussian Approximation , 2016, ICML.

[9] Geoffrey J. Gordon. Stable Function Approximation in Dynamic Programming , 1995, ICML.

[10] Yoshua Bengio,et al. Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[11] Martha White,et al. Maxmin Q-learning: Controlling the Estimation Bias of Q-learning , 2020, ICLR.

[12] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[13] Ben J. A. Kröse,et al. Learning from delayed rewards , 1995, Robotics Auton. Syst..

[14] Richard Bellman,et al. Dynamic Programming and Stochastic Control Processes , 1958, Inf. Control..

[15] Peter Dayan,et al. Q-learning , 1992, Machine Learning.

[16] Nahum Shimkin,et al. Averaged-DQN: Variance Reduction and Stabilization for Deep Reinforcement Learning , 2016, ICML.