论文信息 - VA-learning as a more efficient alternative to Q-learning

VA-learning as a more efficient alternative to Q-learning

In reinforcement learning, the advantage function is critical for policy improvement, but is often extracted from a learned Q-function. A natural question is: Why not learn the advantage function directly? In this work, we introduce VA-learning, which directly learns advantage function and value function using bootstrapping, without explicit reference to Q-functions. VA-learning learns off-policy and enjoys similar theoretical guarantees as Q-learning. Thanks to the direct learning of advantage function and value function, VA-learning improves the sample efficiency over Q-learning both in tabular implementations and deep RL agents on Atari-57 games. We also identify a close connection between VA-learning and the dueling architecture, which partially explains why a simple architectural change to DQN agents tends to improve performance.

R. Munos | Mark Rowland | Yunhao Tang | M. Vaĺko

[1] B. Schölkopf,et al. Direct Advantage Estimation , 2021, NeurIPS.

[2] Marc G. Bellemare,et al. Deep Reinforcement Learning at the Edge of the Statistical Precipice , 2021, NeurIPS.

[3] Dongqi Han,et al. Gap-Increasing Policy Evaluation for Efficient and Noise-Tolerant Reinforcement Learning , 2019, ArXiv.

[4] Tom Schaul,et al. Rainbow: Combining Improvements in Deep Reinforcement Learning , 2017, AAAI.

[5] Nando de Freitas,et al. Sample Efficient Actor-Critic with Experience Replay , 2016, ICLR.

[6] Marc G. Bellemare,et al. Increasing the Action Gap: New Operators for Reinforcement Learning , 2015, AAAI.

[7] Tom Schaul,et al. Dueling Network Architectures for Deep Reinforcement Learning , 2015, ICML.

[8] David Silver,et al. Deep Reinforcement Learning with Double Q-Learning , 2015, AAAI.

[9] Alex Graves,et al. Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[10] Marc G. Bellemare,et al. The Arcade Learning Environment: An Evaluation Platform for General Agents , 2012, J. Artif. Intell. Res..

[11] Ben J. A. Kröse,et al. Learning from delayed rewards , 1995, Robotics Auton. Syst..

[12] Leemon C. Baird,et al. Residual Algorithms: Reinforcement Learning with Function Approximation , 1995, ICML.

[13] John N. Tsitsiklis,et al. Asynchronous Stochastic Approximation and Q-Learning , 1994, Machine Learning.

[14] Peter Dayan,et al. Q-learning , 1992, Machine Learning.

[15] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[16] Michael I. Jordan,et al. On the Convergence of Stochastic Iterative Dynamic Programming Algorithms , 1994, Neural Computation.