Provably Efficient Reinforcement Learning with Linear Function Approximation

Modern Reinforcement Learning (RL) is commonly applied to practical problems with an enormous number of states, where function approximation must be deployed to approximate either the value function or the policy. The introduction of function approximation raises a fundamental set of challenges involving computational and statistical efficiency, especially given the need to manage the exploration/exploitation tradeoff. As a result, a core RL question remains open: how can we design provably efficient RL algorithms that incorporate function approximation? This question persists even in a basic setting with linear dynamics and linear rewards, for which only linear function approximation is needed. This paper presents the first provable RL algorithm with both polynomial runtime and polynomial sample complexity in this linear setting, without requiring a "simulator" or additional assumptions. Concretely, we prove that an optimistic modification of Least-Squares Value Iteration (LSVI)---a classical algorithm frequently studied in the linear setting---achieves $\tilde{\mathcal{O}}(\sqrt{d^3H^3T})$ regret, where $d$ is the ambient dimension of feature space, $H$ is the length of each episode, and $T$ is the total number of steps. Importantly, such regret is independent of the number of states and actions.

[1]  Roman Vershynin,et al.  Introduction to the non-asymptotic analysis of random matrices , 2010, Compressed Sensing.

[2]  Lihong Li,et al.  PAC model-free reinforcement learning , 2006, ICML.

[3]  Luc Devroye,et al.  Random-Walk Perturbations for Online Combinatorial Optimization , 2015, IEEE Transactions on Information Theory.

[4]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[5]  Hilbert J. Kappen,et al.  On the Sample Complexity of Reinforcement Learning with a Generative Model , 2012, ICML.

[6]  Jianfeng Gao,et al.  Deep Reinforcement Learning for Dialogue Generation , 2016, EMNLP.

[7]  John N. Tsitsiklis,et al.  Linearly Parameterized Bandits , 2008, Math. Oper. Res..

[8]  Wei Chu,et al.  Contextual Bandits with Linear Payoff Functions , 2011, AISTATS.

[9]  Yanjun Han,et al.  Batched Multi-armed Bandits Problem , 2019, NeurIPS.

[10]  Yishay Mansour,et al.  Learning Linear-Quadratic Regulators Efficiently with only $\sqrt{T}$ Regret , 2019, ICML.

[11]  Mengdi Wang,et al.  Reinforcement Leaning in Feature Space: Matrix Bandit, Kernels, and Regret Bound , 2019, ICML.

[12]  Rémi Munos,et al.  Minimax Regret Bounds for Reinforcement Learning , 2017, ICML.

[13]  Csaba Szepesvári,et al.  Improved Algorithms for Linear Stochastic Bandits , 2011, NIPS.

[14]  David Simchi-Levi,et al.  Instance-Dependent Complexity of Contextual Bandits and Reinforcement Learning: A Disagreement-Based Perspective , 2020, COLT.

[15]  Cynthia Rudin,et al.  A Practical Bandit Method with Advantages in Neural Network Tuning , 2019, ArXiv.

[16]  Peter Auer,et al.  Near-optimal Regret Bounds for Reinforcement Learning , 2008, J. Mach. Learn. Res..

[17]  Benjamin Van Roy,et al.  Generalization and Exploration via Randomized Value Functions , 2014, ICML.

[18]  Lin F. Yang,et al.  A Provably Efficient Algorithm for Linear Markov Decision Process with Low Switching Cost , 2021, ArXiv.

[19]  Silvio Lattanzi,et al.  Consistent Online Optimization: Convex and Submodular , 2019, AISTATS.

[20]  Nikolai Matni,et al.  Regret Bounds for Robust Adaptive Control of the Linear Quadratic Regulator , 2018, NeurIPS.

[21]  Yuan Zhou,et al.  Linear bandits with limited adaptivity and learning distributional optimal design , 2020, STOC.

[22]  Quanquan Gu,et al.  Logarithmic Regret for Reinforcement Learning with Linear Function Approximation , 2020, ICML.

[23]  Thomas P. Hayes,et al.  Stochastic Linear Optimization under Bandit Feedback , 2008, COLT.

[24]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[25]  Jan Peters,et al.  Reinforcement learning in robotics: A survey , 2013, Int. J. Robotics Res..

[26]  Francisco S. Melo,et al.  Q -Learning with Linear Function Approximation , 2007, COLT.

[27]  Reid G. Simmons,et al.  Complexity Analysis of Real-Time Reinforcement Learning , 1993, AAAI.

[28]  Michael I. Jordan,et al.  Bridging Exploration and General Function Approximation in Reinforcement Learning: Provably Efficient Kernel and Neural Value Iterations , 2020, ArXiv.

[29]  Ruosong Wang,et al.  Provably Efficient Q-learning with Function Approximation via Distribution Shift Error Checking Oracle , 2019, NeurIPS.

[30]  Ruosong Wang,et al.  Is a Good Representation Sufficient for Sample Efficient Reinforcement Learning? , 2020, ICLR.

[31]  Yuval Peres,et al.  Bandits with switching costs: T2/3 regret , 2013, STOC.

[32]  Csaba Szepesvári,et al.  Algorithms for Reinforcement Learning , 2010, Synthesis Lectures on Artificial Intelligence and Machine Learning.

[33]  Chi Jin,et al.  Provably Efficient Exploration in Policy Optimization , 2020, ICML.

[34]  Quanquan Gu,et al.  Nearly Minimax Optimal Reinforcement Learning for Linear Mixture Markov Decision Processes , 2020, COLT.

[35]  Vianney Perchet,et al.  Batched Bandit Problems , 2015, COLT.

[36]  Alex Graves,et al.  Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[37]  Ruosong Wang,et al.  Reinforcement Learning with General Value Function Approximation: Provably Efficient Approach via Bounded Eluder Dimension , 2020, NeurIPS.

[38]  Ruosong Wang,et al.  Optimism in Reinforcement Learning with Generalized Linear Function Approximation , 2019, ICLR.

[39]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[40]  Kunal Talwar,et al.  Online learning over a finite action set with limited switching , 2018, COLT.

[41]  Xiangyang Ji,et al.  Is Reinforcement Learning More Difficult Than Bandits? A Near-optimal Algorithm Escaping the Curse of Horizon , 2020, ArXiv.

[42]  Mengdi Wang,et al.  Sample-Optimal Parametric Q-Learning Using Linearly Additive Features , 2019, ICML.

[43]  John N. Tsitsiklis,et al.  Analysis of temporal-difference learning with function approximation , 1996, NIPS 1996.

[44]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[45]  Sébastien Bubeck,et al.  Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems , 2012, Found. Trends Mach. Learn..

[46]  Ruosong Wang,et al.  Provably Efficient Reinforcement Learning with General Value Function Approximation , 2020, ArXiv.

[47]  Gergely Neu,et al.  A Unifying View of Optimism in Episodic Reinforcement Learning , 2020, NeurIPS.

[48]  Zheng Wen,et al.  Efficient Reinforcement Learning in Deterministic Systems with Value Function Generalization , 2013, Math. Oper. Res..

[49]  Nicolò Cesa-Bianchi,et al.  Online Learning with Switching Costs and Other Adaptive Adversaries , 2013, NIPS.

[50]  Csaba Szepesvári,et al.  Regret Bounds for the Adaptive Control of Linear Quadratic Systems , 2011, COLT.

[51]  Yu Bai,et al.  Provably Efficient Q-Learning with Low Switching Cost , 2019, NeurIPS.

[52]  Yishay Mansour,et al.  Learning Linear-Quadratic Regulators Efficiently with only $\sqrt{T}$ Regret , 2019, ArXiv.

[53]  Berthold Vöcking,et al.  Regret Minimization for Online Buffering Problems Using the Weighted Majority Algorithm , 2010, Electron. Colloquium Comput. Complex..

[54]  David S. Touretzky,et al.  Advances in neural information processing systems 2 , 1989 .

[55]  Xian Wu,et al.  Variance reduced value iteration and faster algorithms for solving Markov decision processes , 2017, SODA.

[56]  Martin J. Wainwright,et al.  Variance-reduced Q-learning is minimax optimal , 2019, ArXiv.

[57]  Ambuj Tewari,et al.  Online Bandit Learning against an Adaptive Adversary: from Regret to Policy Regret , 2012, ICML.

[58]  Alessandro Lazaric,et al.  Learning Near Optimal Policies with Low Inherent Bellman Error , 2020, ICML.

[59]  Quanquan Gu,et al.  Provably Efficient Reinforcement Learning for Discounted MDPs with Feature Mapping , 2020, ICML.

[60]  Mengdi Wang,et al.  Model-Based Reinforcement Learning with Value-Targeted Regression , 2020, L4DC.

[61]  Tor Lattimore,et al.  Learning with Good Feature Representations in Bandits and in RL with a Generative Model , 2020, ICML.

[62]  Zheng Wen,et al.  Efficient Exploration and Value Function Generalization in Deterministic Systems , 2013, NIPS.

[63]  Kamyar Azizzadenesheli,et al.  Efficient Exploration Through Bayesian Deep Q-Networks , 2018, 2018 Information Theory and Applications Workshop (ITA).

[64]  Andrew W. Moore,et al.  Generalization in Reinforcement Learning: Safely Approximating the Value Function , 1994, NIPS.

[65]  Peter Auer,et al.  Using Confidence Bounds for Exploitation-Exploration Trade-offs , 2003, J. Mach. Learn. Res..

[66]  Leemon C. Baird,et al.  Residual Algorithms: Reinforcement Learning with Function Approximation , 1995, ICML.

[67]  Nevena Lazic,et al.  Model-Free Linear Quadratic Control via Reduction to Expert Prediction , 2018, AISTATS.

[68]  David B. Dunson,et al.  Lipschitz Bandit Optimization with Improved Efficiency , 2019, ArXiv.

[69]  Hilbert J. Kappen,et al.  Speedy Q-Learning , 2011, NIPS.

[70]  Steven J. Bradtke,et al.  Linear Least-Squares algorithms for temporal difference learning , 2004, Machine Learning.

[71]  Tor Lattimore,et al.  PAC Bounds for Discounted MDPs , 2012, ALT.

[72]  Yanjun Han,et al.  Sequential Batch Learning in Finite-Action Linear Contextual Bandits , 2020, ArXiv.

[73]  Santosh S. Vempala,et al.  Efficient algorithms for online decision problems , 2005, J. Comput. Syst. Sci..

[74]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[75]  Chi Jin Provably Efficient Reinforcement Learning with Linear Function Approximation , 2020 .

[76]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[77]  Benjamin Van Roy,et al.  On Lower Bounds for Regret in Reinforcement Learning , 2016, ArXiv.

[78]  Wei Chu,et al.  A contextual-bandit approach to personalized news article recommendation , 2010, WWW '10.

[79]  Alessandro Lazaric,et al.  Improved Regret Bounds for Thompson Sampling in Linear Quadratic Control Problems , 2018, ICML.

[80]  Tor Lattimore,et al.  Unifying PAC and Regret: Uniform PAC Bounds for Episodic Reinforcement Learning , 2017, NIPS.

[81]  Nan Jiang,et al.  Contextual Decision Processes with low Bellman rank are PAC-Learnable , 2016, ICML.

[82]  Amin Karbasi,et al.  Minimax Regret of Switching-Constrained Online Convex Optimization: No Phase Transition , 2020, NeurIPS.