Implicit Under-Parameterization Inhibits Data-Efficient Deep Reinforcement Learning

We identify an implicit under-parameterization phenomenon in value-based deep RL methods that use bootstrapping: when value functions, approximated using deep neural networks, are trained with gradient descent using iterated regression onto target values generated by previous instances of the value network, more gradient updates decrease the expressivity of the current value network. We characterize this loss of expressivity in terms of a drop in the rank of the learned value network features, and show that this corresponds to a drop in performance. We demonstrate this phenomenon on widely studies domains, including Atari and Gym benchmarks, in both offline and online RL settings. We formally analyze this phenomenon and show that it results from a pathological interaction between bootstrapping and gradient-based optimization. We further show that mitigating implicit under-parameterization by controlling rank collapse improves performance.

[1]  Yufeng Zhang,et al.  Can Temporal-Difference and Q-Learning Learn Representation? A Mean-Field Theory , 2020, NeurIPS.

[2]  Martin A. Riedmiller Neural Fitted Q Iteration - First Experiences with a Data Efficient Neural Reinforcement Learning Method , 2005, ECML.

[3]  Ruosong Wang,et al.  Agnostic Q-learning with Function Approximation in Deterministic Systems: Tight Bounds on Approximation Error and Sample Complexity , 2020, ArXiv.

[4]  Marc G. Bellemare,et al.  The Arcade Learning Environment: An Evaluation Platform for General Agents , 2012, J. Artif. Intell. Res..

[5]  Michail G. Lagoudakis,et al.  Least-Squares Policy Iteration , 2003, J. Mach. Learn. Res..

[6]  Pieter Abbeel,et al.  Towards Characterizing Divergence in Deep Q-Learning , 2019, ArXiv.

[7]  Rudolf Taschner,et al.  The Dirichlet Approximation Theorem , 1986 .

[8]  Nathan Srebro,et al.  Implicit Regularization in Matrix Factorization , 2017, 2018 Information Theory and Applications Workshop (ITA).

[9]  Sergey Levine,et al.  Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems , 2020, ArXiv.

[10]  Sergey Levine,et al.  DisCor: Corrective Feedback in Reinforcement Learning via Distribution Correction , 2020, NeurIPS.

[11]  Joelle Pineau,et al.  Interference and Generalization in Temporal Difference Learning , 2020, ICML.

[12]  Tom Schaul,et al.  Reinforcement Learning with Unsupervised Auxiliary Tasks , 2016, ICLR.

[13]  Mohammad Norouzi,et al.  An Optimistic Perspective on Offline Deep Reinforcement Learning , 2020, International Conference on Machine Learning.

[14]  Xin Xu,et al.  Kernel Least-Squares Temporal Difference Learning , 2006 .

[15]  Justin A. Boyan,et al.  Least-Squares Temporal Difference Learning , 1999, ICML.

[16]  Quanquan Gu,et al.  A Finite-Time Analysis of Q-Learning with Neural Network Function Approximation , 2020, ICML.

[17]  Dina Katabi,et al.  Harnessing Structures for Value-Based Planning and Reinforcement Learning , 2020, ICLR.

[18]  Yoshua Bengio,et al.  Revisiting Fundamentals of Experience Replay , 2020, ICML.

[19]  Shalabh Bhatnagar,et al.  Fast gradient-descent methods for temporal-difference learning with linear function approximation , 2009, ICML '09.

[20]  Csaba Szepesvári,et al.  Error Propagation for Approximate Policy and Value Iteration , 2010, NIPS.

[21]  Benjamin Van Roy,et al.  The linear programming approach to approximate dynamic programming: theory and application , 2002 .

[22]  Wei Chen,et al.  I4R: Promoting Deep Reinforcement Learning by the Indicator for Expressive Representations , 2020, IJCAI.

[23]  Tom Schaul,et al.  Rainbow: Combining Improvements in Deep Reinforcement Learning , 2017, AAAI.

[24]  Marc G. Bellemare,et al.  Representations for Stable Off-Policy Reinforcement Learning , 2020, ICML.

[25]  Nan Jiang,et al.  Information-Theoretic Considerations in Batch Reinforcement Learning , 2019, ICML.

[26]  Sanjeev Arora,et al.  Implicit Regularization in Deep Matrix Factorization , 2019, NeurIPS.

[27]  Xin Xu,et al.  Kernel-Based Least Squares Policy Iteration for Reinforcement Learning , 2007, IEEE Transactions on Neural Networks.

[28]  Arthur Jacot,et al.  Neural tangent kernel: convergence and generalization in neural networks (invited paper) , 2018, NeurIPS.

[29]  Martha White,et al.  The Utility of Sparse Representations for Control in Reinforcement Learning , 2018, AAAI.

[30]  Yoshua Bengio,et al.  On Catastrophic Interference in Atari 2600 Games , 2020, ArXiv.

[31]  Doina Precup,et al.  Off-Policy Deep Reinforcement Learning without Exploration , 2018, ICML.

[32]  Sanjeev Arora,et al.  On the Optimization of Deep Networks: Implicit Acceleration by Overparameterization , 2018, ICML.

[33]  Matteo Hessel,et al.  When to use parametric models in reinforcement learning? , 2019, NeurIPS.

[34]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[35]  Ruosong Wang,et al.  Provably Efficient Q-learning with Function Approximation via Distribution Shift Error Checking Oracle , 2019, NeurIPS.

[36]  Shalabh Bhatnagar,et al.  Convergent Temporal-Difference Learning with Arbitrary Smooth Function Approximation , 2009, NIPS.

[37]  Hossein Mobahi,et al.  Self-Distillation Amplifies Regularization in Hilbert Space , 2020, NeurIPS.

[38]  Demis Hassabis,et al.  Mastering the game of Go without human knowledge , 2017, Nature.

[39]  Sergey Levine,et al.  Diagnosing Bottlenecks in Deep Q-learning Algorithms , 2019, ICML.

[40]  Pierre Geurts,et al.  Tree-Based Batch Mode Reinforcement Learning , 2005, J. Mach. Learn. Res..

[41]  Marc G. Bellemare,et al.  A Distributional Perspective on Reinforcement Learning , 2017, ICML.

[42]  George Tucker,et al.  Conservative Q-Learning for Offline Reinforcement Learning , 2020, NeurIPS.

[43]  Nan Jiang,et al.  Batch Value-function Approximation with Only Realizability , 2020, ICML.

[44]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[45]  Philip Bachman,et al.  Deep Reinforcement Learning that Matters , 2017, AAAI.

[46]  Jason D. Lee,et al.  Neural Temporal-Difference and Q-Learning Provably Converge to Global Optima. , 2019, 1905.10027.

[47]  H. Eom Green’s Functions: Applications , 2004 .

[48]  Sergey Levine,et al.  Stabilizing Off-Policy Q-Learning via Bootstrapping Error Reduction , 2019, NeurIPS.

[49]  Shimon Whiteson,et al.  The Impact of Non-stationarity on Generalisation in Deep Reinforcement Learning , 2020, ArXiv.

[50]  Zhuoran Yang,et al.  A Theoretical Analysis of Deep Q-Learning , 2019, L4DC.

[51]  Rémi Munos,et al.  Error Bounds for Approximate Policy Iteration , 2003, ICML.

[52]  Le Song,et al.  SBEED: Convergent Reinforcement Learning with Nonlinear Function Approximation , 2017, ICML.

[53]  Marc G. Bellemare,et al.  Dopamine: A Research Framework for Deep Reinforcement Learning , 2018, ArXiv.

[54]  Mohammad Norouzi,et al.  An Optimistic Perspective on Offline Reinforcement Learning , 2020, ICML.

[55]  Sergey Levine,et al.  Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , 2018, ICML.

[56]  Martin A. Riedmiller,et al.  Batch Reinforcement Learning , 2012, Reinforcement Learning.