Going Beyond Linear RL: Sample Efficient Neural Function Approximation

Deep Reinforcement Learning (RL) powered by neural net approximation of the Q function has had enormous empirical success. While the theory of RL has traditionally focused on linear function approximation (or eluder dimension) approaches, little is known about nonlinear RL with neural net approximations of the Q functions. This is the focus of this work, where we study function approximation with two-layer neural networks (considering both ReLU and polynomial activation functions). Our first result is a computationally and statistically efficient algorithm in the generative model setting under completeness for two-layer neural networks. Our second result considers this setting but under only realizability of the neural net function class. Here, assuming deterministic dynamics, the sample complexity scales linearly in the algebraic dimension. In all cases, our results significantly improve upon what can be attained with linear (or eluder dimension) methods.

[1]  Tengyu Ma,et al.  Provable Model-based Nonlinear Bandit and Reinforcement Learning: Shelve Optimism, Embrace Virtual Curvature , 2021, ArXiv.

[2]  Nathan Srebro,et al.  Lexicographic and Depth-Sensitive Margins in Homogeneous and Non-Homogeneous Deep Models , 2019, ICML.

[3]  Zheng Wen,et al.  Efficient Reinforcement Learning in Deterministic Systems with Value Function Generalization , 2013, Math. Oper. Res..

[4]  John Langford,et al.  Active Learning for Cost-Sensitive Classification , 2017, ICML.

[5]  Colin Wei,et al.  Shape Matters: Understanding the Implicit Bias of the Noise Covariance , 2020, COLT.

[6]  Lihong Li,et al.  PAC model-free reinforcement learning , 2006, ICML.

[7]  Zheng Wen,et al.  Efficient Exploration and Value Function Generalization in Deterministic Systems , 2013, NIPS.

[8]  Ruosong Wang,et al.  Provably Efficient Reinforcement Learning with General Value Function Approximation , 2020, ArXiv.

[9]  Jan Vybíral,et al.  Identification of Shallow Neural Networks by Fewest Samples , 2018, Information and Inference: A Journal of the IMA.

[10]  Tengyu Ma,et al.  Beyond Lazy Training for Over-parameterized Tensor Decomposition , 2020, NeurIPS.

[11]  Tengyu Ma,et al.  Learning One-hidden-layer Neural Networks with Landscape Design , 2017, ICLR.

[12]  Nan Jiang,et al.  Provably efficient RL with Rich Observations via Latent State Decoding , 2019, ICML.

[13]  Emma Brunskill,et al.  Tighter Problem-Dependent Regret Bounds in Reinforcement Learning without Domain Knowledge using Value Function Bounds , 2019, ICML.

[14]  Chi Jin,et al.  Bellman Eluder Dimension: New Rich Classes of RL Problems, and Sample-Efficient Algorithms , 2021, NeurIPS.

[15]  Peter Auer,et al.  Near-optimal Regret Bounds for Reinforcement Learning , 2008, J. Mach. Learn. Res..

[16]  Cho-Jui Hsieh,et al.  Convergence of Adversarial Training in Overparametrized Networks , 2019, ArXiv.

[17]  Kevin Waugh,et al.  DeepStack: Expert-level artificial intelligence in heads-up no-limit poker , 2017, Science.

[18]  Sham M. Kakade,et al.  Optimal Gradient-based Algorithms for Non-concave Bandit Optimization , 2021, NeurIPS.

[19]  Nan Jiang,et al.  On Oracle-Efficient PAC RL with Rich Observations , 2018, NeurIPS.

[20]  Michael I. Jordan,et al.  Provably Efficient Reinforcement Learning with Linear Function Approximation , 2019, COLT.

[21]  Mengdi Wang,et al.  Reinforcement Leaning in Feature Space: Matrix Bandit, Kernels, and Regret Bound , 2019, ICML.

[22]  Nan Jiang,et al.  Contextual Decision Processes with low Bellman rank are PAC-Learnable , 2016, ICML.

[23]  Nathan Srebro,et al.  Implicit Bias in Deep Linear Classification: Initialization Scale vs Training Accuracy , 2020, NeurIPS.

[24]  Nevena Lazic,et al.  Exploration-Enhanced POLITEX , 2019, ArXiv.

[25]  Yuanzhi Li,et al.  A Convergence Theory for Deep Learning via Over-Parameterization , 2018, ICML.

[26]  Liwei Wang,et al.  Gradient Descent Finds Global Minima of Deep Neural Networks , 2018, ICML.

[27]  Philip Bachman,et al.  Deep Reinforcement Learning that Matters , 2017, AAAI.

[28]  Anima Anandkumar,et al.  Beating the Perils of Non-Convexity: Guaranteed Training of Neural Networks using Tensor Methods , 2017 .

[29]  Benjamin Van Roy,et al.  (More) Efficient Reinforcement Learning via Posterior Sampling , 2013, NIPS.

[30]  Lin F. Yang,et al.  Near-Optimal Time and Sample Complexities for Solving Discounted Markov Decision Process with a Generative Model , 2018, 1806.01492.

[31]  Ambuj Tewari,et al.  Sequential complexities and uniform martingale laws of large numbers , 2015 .

[32]  Jian Peng,et al.  √n-Regret for Learning in Markov Decision Processes with Function Approximation and Low Bellman Rank , 2019, COLT.

[33]  Ruosong Wang,et al.  Optimism in Reinforcement Learning with Generalized Linear Function Approximation , 2019, ICLR.

[34]  Csaba Szepesvári,et al.  Improved Algorithms for Linear Stochastic Bandits , 2011, NIPS.

[35]  Ruosong Wang,et al.  Agnostic Q-learning with Function Approximation in Deterministic Systems: Tight Bounds on Approximation Error and Sample Complexity , 2020, ArXiv.

[36]  Cong Fang,et al.  Modeling from Features: a Mean-field Framework for Over-parameterized Deep Neural Networks , 2020, COLT.

[37]  Michael I. Jordan,et al.  Provably Efficient Reinforcement Learning with Kernel and Neural Function Approximations , 2020, NeurIPS.

[38]  Colin Wei,et al.  Regularization Matters: Generalization and Optimization of Neural Nets v.s. their Induced Kernel , 2018, NeurIPS.

[39]  Zhaoran Wang,et al.  Neural Policy Gradient Methods: Global Optimality and Rates of Convergence , 2019, ICLR.

[40]  Jakub W. Pachocki,et al.  Dota 2 with Large Scale Deep Reinforcement Learning , 2019, ArXiv.

[41]  Jason D. Lee,et al.  Beyond Linearization: On Quadratic and Higher-Order Approximation of Wide Neural Networks , 2019, ICLR.

[42]  Yu Bai,et al.  Towards Understanding Hierarchical Learning: Benefits of Neural Representations , 2020, NeurIPS.

[43]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[44]  Dhruv Malik,et al.  Sample Efficient Reinforcement Learning In Continuous State Spaces: A Perspective Beyond Linearity , 2021, ICML.

[45]  Michael I. Jordan,et al.  Is Q-learning Provably Efficient? , 2018, NeurIPS.

[46]  Tengyu Ma,et al.  Label Noise SGD Provably Prefers Flat Global Minimizers , 2021, NeurIPS.

[47]  Massimo Fornasier,et al.  Robust and Resource-Efficient Identification of Two Hidden Layer Neural Networks , 2019, Constructive Approximation.

[48]  Sham M. Kakade,et al.  On the sample complexity of reinforcement learning. , 2003 .

[49]  Sham M. Kakade,et al.  An Exponential Lower Bound for Linearly-Realizable MDPs with Constant Suboptimality Gap , 2021, NeurIPS.

[50]  Inderjit S. Dhillon,et al.  Recovery Guarantees for One-hidden-layer Neural Networks , 2017, ICML.

[51]  Ambuj Tewari,et al.  Online learning via sequential complexities , 2010, J. Mach. Learn. Res..

[52]  Alex Graves,et al.  Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[53]  Yuan Cao,et al.  Stochastic Gradient Descent Optimizes Over-parameterized Deep ReLU Networks , 2018, ArXiv.

[54]  Nathan Srebro,et al.  Eluder Dimension and Generalized Rank , 2021, ArXiv.

[55]  Csaba Szepesv'ari,et al.  Exponential Lower Bounds for Planning in MDPs With Linearly-Realizable Optimal Action-Value Functions , 2020, ALT.

[56]  Nan Jiang,et al.  Model-based RL in Contextual Decision Processes: PAC bounds and Exponential Improvements over Model-free Approaches , 2018, COLT.

[57]  Andrea Montanari,et al.  Linearized two-layers neural networks in high dimension , 2019, The Annals of Statistics.

[58]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[59]  Amnon Shashua,et al.  Safe, Multi-Agent, Reinforcement Learning for Autonomous Driving , 2016, ArXiv.

[60]  Quanquan Gu,et al.  Neural Contextual Bandits with UCB-based Exploration , 2019, ICML.

[61]  Gilad Yehudai,et al.  On the Power and Limitations of Random Features for Understanding Neural Networks , 2019, NeurIPS.

[62]  Percy Liang,et al.  Tensor Factorization via Matrix Factorization , 2015, AISTATS.

[63]  Nathan Srebro,et al.  Kernel and Deep Regimes in Overparametrized Models , 2019, ArXiv.

[64]  Wei Hu,et al.  Provable Benefits of Representation Learning in Linear Bandits , 2020, ArXiv.

[65]  Shachar Lovett,et al.  Bilinear Classes: A Structural Framework for Provable Generalization in RL , 2021, ICML.

[66]  Qi Cai,et al.  Neural Proximal/Trust Region Policy Optimization Attains Globally Optimal Policy , 2019, ArXiv.

[67]  Zhiqiang Xu,et al.  Generalized phase retrieval : measurement number, matrix recovery and beyond , 2016, Applied and Computational Harmonic Analysis.

[68]  Marie-Françoise Roy,et al.  Real algebraic geometry , 1992 .

[69]  Alessandro Lazaric,et al.  Learning Near Optimal Policies with Low Inherent Bellman Error , 2020, ICML.

[70]  Sham M. Kakade,et al.  Few-Shot Learning via Learning the Representation, Provably , 2020, ICLR.

[71]  Rémi Munos,et al.  Minimax Regret Bounds for Reinforcement Learning , 2017, ICML.

[72]  Benjamin Van Roy,et al.  Eluder Dimension and the Sample Complexity of Optimistic Exploration , 2013, NIPS.

[73]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[74]  Yuanzhi Li,et al.  What Can ResNet Learn Efficiently, Going Beyond Kernels? , 2019, NeurIPS.

[75]  Haipeng Luo,et al.  Practical Contextual Bandits with Regression Oracles , 2018, ICML.

[76]  Daniel Russo,et al.  Worst-Case Regret Bounds for Exploration via Randomized Value Functions , 2019, NeurIPS.

[77]  Ruosong Wang,et al.  Is a Good Representation Sufficient for Sample Efficient Reinforcement Learning? , 2020, ICLR.