Deep Q-Network with Proximal Iteration

We employ Proximal Iteration for value-function optimization in reinforcement learning. Proximal Iteration is a computationally efficient technique that enables us to bias the optimization procedure towards more desirable solutions. As a concrete application of Proximal Iteration in deep reinforcement learning, we endow the objective function of the Deep Q-Network (DQN) agent with a proximal term to ensure that the online-network component of DQN remains in the vicinity of the target network. The resultant agent, which we call DQN with Proximal Iteration, or DQNPro, exhibits significant improvements over the original DQN on the Atari benchmark. Our results accentuate the power of employing sound optimization techniques for deep reinforcement learning.

[1]  James G. Scott,et al.  Proximal Algorithms in Statistics and Machine Learning , 2015, ArXiv.

[2]  Harm van Seijen,et al.  Using a Logarithmic Mapping to Enable Lower Discount Factors in Reinforcement Learning , 2019, NeurIPS.

[3]  Marc Teboulle,et al.  A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems , 2009, SIAM J. Imaging Sci..

[4]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[5]  Marc G. Bellemare,et al.  A Distributional Perspective on Reinforcement Learning , 2017, ICML.

[6]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[7]  Alexander J. Smola,et al.  P3O: Policy-on Policy-off Policy Optimization , 2019, UAI.

[8]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[9]  Dimitri P. Bertsekas,et al.  Incremental Gradient, Subgradient, and Proximal Methods for Convex Optimization: A Survey , 2015, ArXiv.

[10]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[11]  Marek Petrik,et al.  Finite-Sample Analysis of Proximal Gradient TD Algorithms , 2015, UAI.

[12]  Gerald Tesauro,et al.  TD-Gammon, a Self-Teaching Backgammon Program, Achieves Master-Level Play , 1994, Neural Computation.

[13]  Alexander J. Smola,et al.  Meta-Q-Learning , 2020, ICLR.

[14]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[15]  Eric Moulines,et al.  Non-Asymptotic Analysis of Stochastic Approximation Algorithms for Machine Learning , 2011, NIPS.

[16]  Marlos C. Machado,et al.  Revisiting the Arcade Learning Environment: Evaluation Protocols and Open Problems for General Agents , 2017, J. Artif. Intell. Res..

[17]  Matteo Hessel,et al.  Deep Reinforcement Learning and the Deadly Triad , 2018, ArXiv.

[18]  Marc G. Bellemare,et al.  The Arcade Learning Environment: An Evaluation Platform for General Agents (Extended Abstract) , 2012, IJCAI.

[19]  Y. Nesterov A method for unconstrained convex minimization problem with the rate of convergence o(1/k^2) , 1983 .

[20]  J. Moreau Proximité et dualité dans un espace hilbertien , 1965 .

[21]  Herke van Hoof,et al.  Addressing Function Approximation Error in Actor-Critic Methods , 2018, ICML.

[22]  Bo Liu,et al.  Sparse Q-learning with Mirror Descent , 2012, UAI.

[23]  Richard S. Sutton,et al.  A Convergent O(n) Temporal-difference Algorithm for Off-policy Learning with Linear Function Approximation , 2008, NIPS.

[24]  Huan Li,et al.  Accelerated Proximal Gradient Methods for Nonconvex Programming , 2015, NIPS.

[25]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[26]  P. Schrimpf,et al.  Dynamic Programming , 2011 .

[27]  Pratik Chaudhari,et al.  Proximal Deterministic Policy Gradient , 2020, 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[28]  Doina Precup,et al.  Exponentiated Gradient Methods for Reinforcement Learning , 1997, ICML.

[29]  Heinz H. Bauschke,et al.  Convex Analysis and Monotone Operator Theory in Hilbert Spaces , 2011, CMS Books in Mathematics.

[30]  Amir Massoud Farahmand,et al.  Action-Gap Phenomenon in Reinforcement Learning , 2011, NIPS.

[31]  John N. Tsitsiklis,et al.  Analysis of Temporal-Diffference Learning with Function Approximation , 1996, NIPS.

[32]  Alexander J. Smola,et al.  Proximal Stochastic Methods for Nonsmooth Nonconvex Finite-Sum Optimization , 2016, NIPS.

[33]  J. Moreau Fonctions convexes duales et points proximaux dans un espace hilbertien , 1962 .

[34]  M. Fukushima,et al.  A generalized proximal point algorithm for certain non-convex minimization problems , 1981 .

[35]  Bo Liu,et al.  Proximal Reinforcement Learning: A New Theory of Sequential Decision Making in Primal-Dual Spaces , 2014, ArXiv.

[36]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[37]  Mohammad Ghavamzadeh,et al.  Mirror Descent Policy Optimization , 2020, ArXiv.

[38]  Kavosh Asadi,et al.  DeepMellow: Removing the Need for a Target Network in Deep Q-Learning , 2019, IJCAI.

[39]  Alexander J. Smola,et al.  Doubly Robust Covariate Shift Correction , 2015, AAAI.

[40]  Marc G. Bellemare,et al.  Dopamine: A Research Framework for Deep Reinforcement Learning , 2018, ArXiv.

[41]  Dimitri P. Bertsekas,et al.  Incremental proximal methods for large scale convex optimization , 2011, Math. Program..

[42]  Kavosh Asadi,et al.  Deep Radial-Basis Value Functions for Continuous Control , 2021, AAAI.

[43]  Sebastian Thrun,et al.  Issues in Using Function Approximation for Reinforcement Learning , 1999 .

[44]  Jian Li,et al.  A Simple Proximal Stochastic Gradient Method for Nonsmooth Nonconvex Optimization , 2018, NeurIPS.

[45]  Donghwan Lee,et al.  Target-Based Temporal Difference Learning , 2019, ICML.

[46]  Demis Hassabis,et al.  Mastering the game of Go without human knowledge , 2017, Nature.

[47]  Patrick L. Combettes,et al.  Proximal Splitting Methods in Signal Processing , 2009, Fixed-Point Algorithms for Inverse Problems in Science and Engineering.

[48]  Gergely Neu,et al.  Logistic $Q$-Learning , 2020, AISTATS.

[49]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[50]  Shimon Whiteson,et al.  Breaking the Deadly Triad with a Target Network , 2021, ICML.

[51]  David Silver,et al.  Deep Reinforcement Learning with Double Q-Learning , 2015, AAAI.

[52]  Jan Peters,et al.  Reinforcement learning in robotics: A survey , 2013, Int. J. Robotics Res..

[53]  Tom Schaul,et al.  Rainbow: Combining Improvements in Deep Reinforcement Learning , 2017, AAAI.

[54]  Geoffrey Zweig,et al.  Hybrid Code Networks: practical and efficient end-to-end dialog control with supervised and reinforcement learning , 2017, ACL.

[55]  Hao He,et al.  Trust Region-Guided Proximal Policy Optimization , 2019, NeurIPS.

[56]  Long Ji Lin,et al.  Self-improving reactive agents based on reinforcement learning, planning and teaching , 1992, Machine Learning.

[57]  Heinz H. Bauschke,et al.  Firmly Nonexpansive Mappings and Maximally Monotone Operators: Correspondence and Duality , 2011, 1101.4688.

[58]  R. Rockafellar Monotone Operators and the Proximal Point Algorithm , 1976 .

[59]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.