论文信息 - Removing the Target Network from Deep Q-Networks with the Mellowmax Operator

Removing the Target Network from Deep Q-Networks with the Mellowmax Operator

Deep Q-Network (DQN) is a learning algorithm that achieves human-level performance in high-dimensional domains like Atari games. We propose that using an softmax operator, Mellowmax, in DQN reduces its need for a separate target network, which is otherwise necessary to stabilize learning. We empirically show that, in the absence of a target network, the combination of Mellowmax and DQN outperforms DQN alone.

Kavosh Asadi | Michael L. Littman | George Konidaris | Seungchan Kim

[1] J. Andrew Bagnell,et al. Modeling Purposeful Adaptive Behavior with the Principle of Maximum Causal Entropy , 2010 .

[2] Marc G. Bellemare,et al. The Arcade Learning Environment: An Evaluation Platform for General Agents (Extended Abstract) , 2012, IJCAI.

[3] Csaba Szepesvári,et al. A Generalized Reinforcement-Learning Model: Convergence and Applications , 1996, ICML.

[4] Vicenç Gómez,et al. A unified view of entropy-regularized Markov decision processes , 2017, ArXiv.

[5] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[6] Shane Legg,et al. Human-level control through deep reinforcement learning , 2015, Nature.

[7] Richard S. Sutton,et al. True Online TD(lambda) , 2014, ICML.

[8] Dale Schuurmans,et al. Bridging the Gap Between Value and Policy Based Reinforcement Learning , 2017, NIPS.

[9] Le Song,et al. SBEED: Convergent Reinforcement Learning with Nonlinear Function Approximation , 2017, ICML.

[10] Roy Fox,et al. Taming the Noise in Reinforcement Learning via Soft Updates , 2015, UAI.

[11] Kavosh Asadi,et al. An Alternative Softmax Operator for Reinforcement Learning , 2016, ICML.

[12] Richard S. Sutton,et al. Learning to Predict Independent of Span , 2015, ArXiv.