Bridging Exploration and General Function Approximation in Reinforcement Learning: Provably Efficient Kernel and Neural Value Iterations

Reinforcement learning (RL) algorithms combined with modern function approximators such as kernel functions and deep neural networks have achieved significant empirical successes in large-scale application problems with a massive number of states. From a theoretical perspective, however, RL with functional approximation poses a fundamental challenge to developing algorithms with provable computational and statistical efficiency, due to the need to take into consideration both the exploration-exploitation tradeoff that is inherent in RL and the bias-variance tradeoff that is innate in statistical estimation. To address such a challenge, focusing on the episodic setting where the action-value functions are represented by a kernel function or over-parametrized neural network, we propose the first provable RL algorithm with both polynomial runtime and sample complexity, without additional assumptions on the data-generating model. In particular, for both the kernel and neural settings, we prove that an optimistic modification of the least-squares value iteration algorithm incurs an $\tilde{\mathcal{O}}(\delta_{\mathcal{F}} H^2 \sqrt{T})$ regret, where $\delta_{\mathcal{F}}$ characterizes the intrinsic complexity of the function class $\mathcal{F}$, $H$ is the length of each episode, and $T$ is the total number of episodes. Our regret bounds are independent of the number of states and therefore even allows it to diverge, which exhibits the benefit of function approximation.

[1]  Qi Cai,et al.  Neural Temporal-Difference Learning Converges to Global Optima , 2019, NeurIPS.

[2]  Yuan Cao,et al.  Generalization Bounds of Stochastic Gradient Descent for Wide and Deep Neural Networks , 2019, NeurIPS.

[3]  Francis Bach,et al.  A Note on Lazy Training in Supervised Differentiable Programming , 2018, ArXiv.

[4]  Nello Cristianini,et al.  Finite-Time Analysis of Kernelised Contextual Bandits , 2013, UAI.

[5]  Albin Cassirer,et al.  Randomized Prior Functions for Deep Reinforcement Learning , 2018, NeurIPS.

[6]  Jason D. Lee,et al.  Beyond Linearization: On Quadratic and Higher-Order Approximation of Wide Neural Networks , 2019, ICLR.

[7]  Yuan Yao,et al.  Mercer's Theorem, Feature Maps, and Smoothing , 2006, COLT.

[8]  Ruosong Wang,et al.  Reinforcement Learning with General Value Function Approximation: Provably Efficient Approach via Bounded Eluder Dimension , 2020, NeurIPS.

[9]  Mengdi Wang,et al.  Sample-Optimal Parametric Q-Learning Using Linearly Additive Features , 2019, ICML.

[10]  Ruosong Wang,et al.  Provably Efficient Reinforcement Learning with General Value Function Approximation , 2020, ArXiv.

[11]  Jaehoon Lee,et al.  Wide neural networks of any depth evolve as linear models under gradient descent , 2019, NeurIPS.

[12]  Benjamin Van Roy,et al.  Eluder Dimension and the Sample Complexity of Optimistic Exploration , 2013, NIPS.

[13]  Yuanzhi Li,et al.  A Convergence Theory for Deep Learning via Over-Parameterization , 2018, ICML.

[14]  Tor Lattimore,et al.  Unifying PAC and Regret: Uniform PAC Bounds for Episodic Reinforcement Learning , 2017, NIPS.

[15]  Martin J. Wainwright,et al.  Divide and conquer kernel ridge regression: a distributed algorithm with minimax optimal rates , 2013, J. Mach. Learn. Res..

[16]  Zheng Wen,et al.  Efficient Reinforcement Learning in Deterministic Systems with Value Function Generalization , 2013, Math. Oper. Res..

[17]  Claus Müller Analysis of Spherical Symmetries in Euclidean Spaces , 1997 .

[18]  Nan Jiang,et al.  Contextual Decision Processes with low Bellman rank are PAC-Learnable , 2016, ICML.

[19]  Benjamin Van Roy,et al.  Model-based Reinforcement Learning and the Eluder Dimension , 2014, NIPS.

[20]  Lihong Li,et al.  PAC model-free reinforcement learning , 2006, ICML.

[21]  Demis Hassabis,et al.  Mastering the game of Go without human knowledge , 2017, Nature.

[22]  Shie Mannor,et al.  Optimistic Policy Optimization with Bandit Feedback , 2020, ICML.

[23]  Daniel Russo,et al.  Worst-Case Regret Bounds for Exploration via Randomized Value Functions , 2019, NeurIPS.

[24]  Michael I. Jordan,et al.  Provably Efficient Reinforcement Learning with Linear Function Approximation , 2019, COLT.

[25]  Pieter Abbeel,et al.  Benchmarking Deep Reinforcement Learning for Continuous Control , 2016, ICML.

[26]  Csaba Szepesvári,et al.  Improved Algorithms for Linear Stochastic Bandits , 2011, NIPS.

[27]  Benjamin Van Roy,et al.  Comments on the Du-Kakade-Wang-Yang Lower Bounds , 2019, ArXiv.

[28]  Benjamin Van Roy,et al.  Generalization and Exploration via Randomized Value Functions , 2014, ICML.

[29]  Cho-Jui Hsieh,et al.  Convergence of Adversarial Training in Overparametrized Neural Networks , 2019, NeurIPS.

[30]  Andreas Christmann,et al.  Support vector machines , 2008, Data Mining and Knowledge Discovery Handbook.

[31]  Debdeep Pati,et al.  Frequentist coverage and sup-norm convergence rate in Gaussian process regression , 2017, 1708.04753.

[32]  Mengdi Wang,et al.  Reinforcement Leaning in Feature Space: Matrix Bandit, Kernels, and Regret Bound , 2019, ICML.

[33]  Aditya Gopalan,et al.  On Kernelized Multi-armed Bandits , 2017, ICML.

[34]  Rémi Munos,et al.  Minimax Regret Bounds for Reinforcement Learning , 2017, ICML.

[35]  Quanquan Gu,et al.  Neural Contextual Bandits with Upper Confidence Bound-Based Exploration , 2019, ArXiv.

[36]  Roman Vershynin,et al.  High-Dimensional Probability , 2018 .

[37]  Steven J. Bradtke,et al.  Linear Least-Squares algorithms for temporal difference learning , 2004, Machine Learning.

[38]  John Langford,et al.  PAC Reinforcement Learning with Rich Observations , 2016, NIPS.

[39]  Michael I. Jordan,et al.  Is Q-learning Provably Efficient? , 2018, NeurIPS.

[40]  Francis Bach,et al.  On Lazy Training in Differentiable Programming , 2018, NeurIPS.

[41]  Joelle Pineau,et al.  Streaming kernel regression with provably adaptive mean, variance, and regularization , 2017, J. Mach. Learn. Res..

[42]  Kazuoki Azuma WEIGHTED SUMS OF CERTAIN DEPENDENT RANDOM VARIABLES , 1967 .

[43]  John D. Lafferty,et al.  Diffusion Kernels on Statistical Manifolds , 2005, J. Mach. Learn. Res..

[44]  Nan Jiang,et al.  On Oracle-Efficient PAC RL with Rich Observations , 2018, NeurIPS.

[45]  Quanquan Gu,et al.  Generalization Error Bounds of Gradient Descent for Learning Over-Parameterized Deep ReLU Networks , 2019, AAAI.

[46]  Wojciech M. Czarnecki,et al.  Grandmaster level in StarCraft II using multi-agent reinforcement learning , 2019, Nature.

[47]  Liwei Wang,et al.  Gradient Descent Finds Global Minima of Deep Neural Networks , 2018, ICML.

[48]  William Yang Wang,et al.  Deep Reinforcement Learning for NLP , 2018, ACL.

[49]  Yuan Cao,et al.  Stochastic Gradient Descent Optimizes Over-parameterized Deep ReLU Networks , 2018, ArXiv.

[50]  Yuanzhi Li,et al.  Learning Overparameterized Neural Networks via Stochastic Gradient Descent on Structured Data , 2018, NeurIPS.

[51]  Guang Cheng,et al.  Local and global asymptotic inference in smoothing spline models , 2012, 1212.6788.

[52]  Yuan Cao,et al.  A Generalization Theory of Gradient Descent for Learning Over-parameterized Deep ReLU Networks , 2019, ArXiv.

[53]  Zheng Wen,et al.  Efficient Exploration and Value Function Generalization in Deterministic Systems , 2013, NIPS.

[54]  Andreas Krause,et al.  No-Regret Learning in Unknown Games with Correlated Payoffs , 2019, NeurIPS.

[55]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[56]  Andreas Krause,et al.  Information-Theoretic Regret Bounds for Gaussian Process Optimization in the Bandit Setting , 2009, IEEE Transactions on Information Theory.

[57]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[58]  Yann LeCun,et al.  Towards Understanding the Role of Over-Parametrization in Generalization of Neural Networks , 2018, ArXiv.

[59]  Ruosong Wang,et al.  Fine-Grained Analysis of Optimization and Generalization for Overparameterized Two-Layer Neural Networks , 2019, ICML.

[60]  KrauseAndreas,et al.  Information-Theoretic Regret Bounds for Gaussian Process Optimization in the Bandit Setting , 2012 .

[61]  Barnabás Póczos,et al.  Gradient Descent Provably Optimizes Over-parameterized Neural Networks , 2018, ICLR.

[62]  Jian Peng,et al.  √n-Regret for Learning in Markov Decision Processes with Function Approximation and Low Bellman Rank , 2019, COLT.

[63]  Chi Jin,et al.  Provably Efficient Exploration in Policy Optimization , 2020, ICML.

[64]  Lei Wu How SGD Selects the Global Minima in Over-parameterized Learning : A Dynamical Stability Perspective , 2018 .

[65]  Ruosong Wang,et al.  Is a Good Representation Sufficient for Sample Efficient Reinforcement Learning? , 2020, ICLR.

[66]  Daniele Calandriello,et al.  Gaussian Process Optimization with Adaptive Sketching: Scalable and No Regret , 2019, COLT.

[67]  Alex Smola,et al.  Kernel methods in machine learning , 2007, math/0701907.

[68]  Quanquan Gu,et al.  Neural Contextual Bandits with UCB-based Exploration , 2019, ICML.

[69]  Quanquan Gu,et al.  Provably Efficient Reinforcement Learning for Discounted MDPs with Feature Mapping , 2020, ICML.

[70]  Mengdi Wang,et al.  Model-Based Reinforcement Learning with Value-Targeted Regression , 2020, L4DC.

[71]  Tor Lattimore,et al.  Learning with Good Feature Representations in Bandits and in RL with a Generative Model , 2020, ICML.

[72]  Benjamin Recht,et al.  Random Features for Large-Scale Kernel Machines , 2007, NIPS.

[73]  Amit Daniely,et al.  SGD Learns the Conjugate Kernel Class of the Network , 2017, NIPS.

[74]  Peter Auer,et al.  Near-optimal Regret Bounds for Reinforcement Learning , 2008, J. Mach. Learn. Res..

[75]  Akshay Krishnamurthy,et al.  Information Theoretic Regret Bounds for Online Nonlinear Control , 2020, NeurIPS.

[76]  S. Mendelson,et al.  Regularization in kernel learning , 2010, 1001.2094.

[77]  Yuanzhi Li,et al.  Learning and Generalization in Overparameterized Neural Networks, Going Beyond Two Layers , 2018, NeurIPS.

[78]  Francis R. Bach,et al.  Breaking the Curse of Dimensionality with Convex Neural Networks , 2014, J. Mach. Learn. Res..

[79]  Alessandro Lazaric,et al.  Learning Near Optimal Policies with Low Inherent Bellman Error , 2020, ICML.

[80]  Nan Jiang,et al.  Provably efficient RL with Rich Observations via Latent State Decoding , 2019, ICML.

[81]  Han Liu,et al.  Nonparametric Heterogeneity Testing For Massive Data , 2016, 1601.06212.

[82]  Alessandro Lazaric,et al.  Frequentist Regret Bounds for Randomized Least-Squares Value Iteration , 2019, AISTATS.

[83]  Arthur Jacot,et al.  Neural tangent kernel: convergence and generalization in neural networks (invited paper) , 2018, NeurIPS.

[84]  Ruosong Wang,et al.  Optimism in Reinforcement Learning with Generalized Linear Function Approximation , 2019, ICLR.

[85]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[86]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[87]  Andreas Krause,et al.  Contextual Gaussian Process Bandit Optimization , 2011, NIPS.