论文信息 - Deep Q-Network with Proximal Iteration

Deep Q-Network with Proximal Iteration

We employ Proximal Iteration for value-function optimization in reinforcement learning. Proximal Iteration is a computationally efficient technique that enables us to bias the optimization procedure towards more desirable solutions. As a concrete application of Proximal Iteration in deep reinforcement learning, we endow the objective function of the Deep Q-Network (DQN) agent with a proximal term to ensure that the online-network component of DQN remains in the vicinity of the target network. The resultant agent, which we call DQN with Proximal Iteration, or DQNPro, exhibits significant improvements over the original DQN on the Atari benchmark. Our results accentuate the power of employing sound optimization techniques for deep reinforcement learning.

[1] James G. Scott,et al. Proximal Algorithms in Statistics and Machine Learning , 2015, ArXiv.

[2] Harm van Seijen,et al. Using a Logarithmic Mapping to Enable Lower Discount Factors in Reinforcement Learning , 2019, NeurIPS.

[3] Marc Teboulle,et al. A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems , 2009, SIAM J. Imaging Sci..

[4] Yuval Tassa,et al. Continuous control with deep reinforcement learning , 2015, ICLR.

[5] Marc G. Bellemare,et al. A Distributional Perspective on Reinforcement Learning , 2017, ICML.

[6] Richard S. Sutton,et al. Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[7] Alexander J. Smola,et al. P3O: Policy-on Policy-off Policy Optimization , 2019, UAI.

[8] Alec Radford,et al. Proximal Policy Optimization Algorithms , 2017, ArXiv.

[9] Dimitri P. Bertsekas,et al. Incremental Gradient, Subgradient, and Proximal Methods for Convex Optimization: A Survey , 2015, ArXiv.

[10] Yoram Singer,et al. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[11] Marek Petrik,et al. Finite-Sample Analysis of Proximal Gradient TD Algorithms , 2015, UAI.

[12] Gerald Tesauro,et al. TD-Gammon, a Self-Teaching Backgammon Program, Achieves Master-Level Play , 1994, Neural Computation.

[13] Alexander J. Smola,et al. Meta-Q-Learning , 2020, ICLR.

[14] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[15] Eric Moulines,et al. Non-Asymptotic Analysis of Stochastic Approximation Algorithms for Machine Learning , 2011, NIPS.

[16] Marlos C. Machado,et al. Revisiting the Arcade Learning Environment: Evaluation Protocols and Open Problems for General Agents , 2017, J. Artif. Intell. Res..

[17] Matteo Hessel,et al. Deep Reinforcement Learning and the Deadly Triad , 2018, ArXiv.

[18] Marc G. Bellemare,et al. The Arcade Learning Environment: An Evaluation Platform for General Agents (Extended Abstract) , 2012, IJCAI.

[19] Y. Nesterov. A method for unconstrained convex minimization problem with the rate of convergence o(1/k^2) , 1983 .

[20] J. Moreau. Proximité et dualité dans un espace hilbertien , 1965 .

[21] Herke van Hoof,et al. Addressing Function Approximation Error in Actor-Critic Methods , 2018, ICML.

[22] Bo Liu,et al. Sparse Q-learning with Mirror Descent , 2012, UAI.

[23] Richard S. Sutton,et al. A Convergent O(n) Temporal-difference Algorithm for Off-policy Learning with Linear Function Approximation , 2008, NIPS.

[24] Huan Li,et al. Accelerated Proximal Gradient Methods for Nonconvex Programming , 2015, NIPS.

[25] Sergey Levine,et al. Trust Region Policy Optimization , 2015, ICML.

[26] P. Schrimpf,et al. Dynamic Programming , 2011 .

[27] Pratik Chaudhari,et al. Proximal Deterministic Policy Gradient , 2020, 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[28] Doina Precup,et al. Exponentiated Gradient Methods for Reinforcement Learning , 1997, ICML.

[29] Heinz H. Bauschke,et al. Convex Analysis and Monotone Operator Theory in Hilbert Spaces , 2011, CMS Books in Mathematics.

[30] Amir Massoud Farahmand,et al. Action-Gap Phenomenon in Reinforcement Learning , 2011, NIPS.

[31] John N. Tsitsiklis,et al. Analysis of Temporal-Diffference Learning with Function Approximation , 1996, NIPS.

[32] Alexander J. Smola,et al. Proximal Stochastic Methods for Nonsmooth Nonconvex Finite-Sum Optimization , 2016, NIPS.

[33] J. Moreau. Fonctions convexes duales et points proximaux dans un espace hilbertien , 1962 .

[34] M. Fukushima,et al. A generalized proximal point algorithm for certain non-convex minimization problems , 1981 .

[35] Bo Liu,et al. Proximal Reinforcement Learning: A New Theory of Sequential Decision Making in Primal-Dual Spaces , 2014, ArXiv.

[36] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[37] Mohammad Ghavamzadeh,et al. Mirror Descent Policy Optimization , 2020, ArXiv.

[38] Kavosh Asadi,et al. DeepMellow: Removing the Need for a Target Network in Deep Q-Learning , 2019, IJCAI.

[39] Alexander J. Smola,et al. Doubly Robust Covariate Shift Correction , 2015, AAAI.