Provably Efficient Reinforcement Learning with Linear Function Approximation Under Adaptivity Constraints

We study reinforcement learning (RL) with linear function approximation under the adaptivity constraint. We consider two popular limited adaptivity models: the batch learning model and the rare policy switch model, and propose two efficient online RL algorithms for episodic linear Markov decision processes, where the transition probability and the reward function can be represented as a linear function of some known feature mapping. In specific, for the batch learning model, our proposed LSVI-UCB-Batch algorithm achieves an $\tilde O(\sqrt{d^3H^3T} + dHT/B)$ regret, where $d$ is the dimension of the feature mapping, $H$ is the episode length, $T$ is the number of interactions and $B$ is the number of batches. Our result suggests that it suffices to use only $\sqrt{T/dH}$ batches to obtain $\tilde O(\sqrt{d^3H^3T})$ regret. For the rare policy switch model, our proposed LSVI-UCB-RareSwitch algorithm enjoys an $\tilde O(\sqrt{d^3H^3T[1+T/(dH)]^{dH/B}})$ regret, which implies that $dH\log T$ policy switches suffice to obtain the $\tilde O(\sqrt{d^3H^3T})$ regret. Our algorithms achieve the same regret as the LSVI-UCB algorithm (Jin et al., 2019), yet with a substantially smaller amount of adaptivity. We also establish a lower bound for the batch learning model, which suggests that the dependency on $B$ in our regret bound is tight.

[1]  Shachar Lovett,et al.  Bilinear Classes: A Structural Framework for Provable Generalization in RL , 2021, ICML.

[2]  Lin F. Yang,et al.  A Provably Efficient Algorithm for Linear Markov Decision Process with Low Switching Cost , 2021, ArXiv.

[3]  Quanquan Gu,et al.  Nearly Minimax Optimal Reinforcement Learning for Linear Mixture Markov Decision Processes , 2020, COLT.

[4]  Quanquan Gu,et al.  Logarithmic Regret for Reinforcement Learning with Linear Function Approximation , 2020, ICML.

[5]  Michael I. Jordan,et al.  Bridging Exploration and General Function Approximation in Reinforcement Learning: Provably Efficient Kernel and Neural Value Iterations , 2020, ArXiv.

[6]  David Simchi-Levi,et al.  Instance-Dependent Complexity of Contextual Bandits and Reinforcement Learning: A Disagreement-Based Perspective , 2020, COLT.

[7]  S. Du,et al.  Is Reinforcement Learning More Difficult Than Bandits? A Near-optimal Algorithm Escaping the Curse of Horizon , 2020, COLT.

[8]  Yuan Zhou,et al.  Linear bandits with limited adaptivity and learning distributional optimal design , 2020, STOC.

[9]  Gergely Neu,et al.  A Unifying View of Optimism in Episodic Reinforcement Learning , 2020, NeurIPS.

[10]  Quanquan Gu,et al.  Provably Efficient Reinforcement Learning for Discounted MDPs with Feature Mapping , 2020, ICML.

[11]  Mengdi Wang,et al.  Model-Based Reinforcement Learning with Value-Targeted Regression , 2020, L4DC.

[12]  Ruosong Wang,et al.  Provably Efficient Reinforcement Learning with General Value Function Approximation , 2020, ArXiv.

[13]  Lin F. Yang,et al.  Reinforcement Learning with General Value Function Approximation: Provably Efficient Approach via Bounded Eluder Dimension , 2020, NeurIPS.

[14]  Yanjun Han,et al.  Sequential Batch Learning in Finite-Action Linear Contextual Bandits , 2020, ArXiv.

[15]  Mykel J. Kochenderfer,et al.  Learning Near Optimal Policies with Low Inherent Bellman Error , 2020, ICML.

[16]  Chi Jin,et al.  Provably Efficient Exploration in Policy Optimization , 2019, ICML.

[17]  Ruosong Wang,et al.  Optimism in Reinforcement Learning with Generalized Linear Function Approximation , 2019, ICLR.

[18]  Csaba Szepesvari,et al.  Learning with Good Feature Representations in Bandits and in RL with a Generative Model , 2019, ICML.

[19]  Amin Karbasi,et al.  Minimax Regret of Switching-Constrained Online Convex Optimization: No Phase Transition , 2019, NeurIPS.

[20]  Amin Karbasi,et al.  Regret Bounds for Batched Bandits , 2019, AAAI.

[21]  Lin F. Yang,et al.  Is a Good Representation Sufficient for Sample Efficient Reinforcement Learning? , 2019, ICLR.

[22]  Ruosong Wang,et al.  Provably Efficient Q-learning with Function Approximation via Distribution Shift Error Checking Oracle , 2019, NeurIPS.

[23]  Martin J. Wainwright,et al.  Variance-reduced Q-learning is minimax optimal , 2019, ArXiv.

[24]  Yu Bai,et al.  Provably Efficient Q-Learning with Low Switching Cost , 2019, NeurIPS.

[25]  Mengdi Wang,et al.  Reinforcement Leaning in Feature Space: Matrix Bandit, Kernels, and Regret Bound , 2019, ICML.

[26]  Yishay Mansour,et al.  Learning Linear-Quadratic Regulators Efficiently with only $\sqrt{T}$ Regret , 2019, ICML.

[27]  David B. Dunson,et al.  Lipschitz Bandit Optimization with Improved Efficiency , 2019, ArXiv.

[28]  Silvio Lattanzi,et al.  Consistent Online Optimization: Convex and Submodular , 2019, AISTATS.

[29]  Yanjun Han,et al.  Batched Multi-armed Bandits Problem , 2019, NeurIPS.

[30]  Yishay Mansour,et al.  Learning Linear-Quadratic Regulators Efficiently with only $\sqrt{T}$ Regret , 2019, ArXiv.

[31]  Mengdi Wang,et al.  Sample-Optimal Parametric Q-Learning Using Linearly Additive Features , 2019, ICML.

[32]  C. Rudin,et al.  Towards Practical Lipschitz Bandits , 2019, FODS.

[33]  Alessandro Lazaric,et al.  Improved Regret Bounds for Thompson Sampling in Linear Quadratic Control Problems , 2018, ICML.

[34]  Nikolai Matni,et al.  Regret Bounds for Robust Adaptive Control of the Linear Quadratic Regulator , 2018, NeurIPS.

[35]  Nevena Lazic,et al.  Model-Free Linear Quadratic Control via Reduction to Expert Prediction , 2018, AISTATS.

[36]  Kunal Talwar,et al.  Online learning over a finite action set with limited switching , 2018, COLT.

[37]  Kamyar Azizzadenesheli,et al.  Efficient Exploration Through Bayesian Deep Q-Networks , 2018, 2018 Information Theory and Applications Workshop (ITA).

[38]  Xian Wu,et al.  Variance reduced value iteration and faster algorithms for solving Markov decision processes , 2017, SODA.

[39]  Rémi Munos,et al.  Minimax Regret Bounds for Reinforcement Learning , 2017, ICML.

[40]  Tor Lattimore,et al.  Unifying PAC and Regret: Uniform PAC Bounds for Episodic Reinforcement Learning , 2017, NIPS.

[41]  Nan Jiang,et al.  Contextual Decision Processes with low Bellman rank are PAC-Learnable , 2016, ICML.

[42]  Benjamin Van Roy,et al.  On Lower Bounds for Regret in Reinforcement Learning , 2016, ArXiv.

[43]  Jianfeng Gao,et al.  Deep Reinforcement Learning for Dialogue Generation , 2016, EMNLP.

[44]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[45]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[46]  Luc Devroye,et al.  Random-Walk Perturbations for Online Combinatorial Optimization , 2015, IEEE Transactions on Information Theory.

[47]  Vianney Perchet,et al.  Batched Bandit Problems , 2015, COLT.

[48]  Michael I. Jordan,et al.  Trust Region Policy Optimization , 2015, ICML.

[49]  Benjamin Van Roy,et al.  Generalization and Exploration via Randomized Value Functions , 2014, ICML.

[50]  Alex Graves,et al.  Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[51]  Yuval Peres,et al.  Bandits with switching costs: T2/3 regret , 2013, STOC.

[52]  Jan Peters,et al.  Reinforcement learning in robotics: A survey , 2013, Int. J. Robotics Res..

[53]  Zheng Wen,et al.  Efficient Reinforcement Learning in Deterministic Systems with Value Function Generalization , 2013, Math. Oper. Res..

[54]  Zheng Wen,et al.  Efficient Exploration and Value Function Generalization in Deterministic Systems , 2013, NIPS.

[55]  Nicolò Cesa-Bianchi,et al.  Online Learning with Switching Costs and Other Adaptive Adversaries , 2013, NIPS.

[56]  Hilbert J. Kappen,et al.  On the Sample Complexity of Reinforcement Learning with a Generative Model , 2012, ICML.

[57]  Ambuj Tewari,et al.  Online Bandit Learning against an Adaptive Adversary: from Regret to Policy Regret , 2012, ICML.

[58]  Sébastien Bubeck,et al.  Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems , 2012, Found. Trends Mach. Learn..

[59]  Tor Lattimore,et al.  PAC Bounds for Discounted MDPs , 2012, ALT.

[60]  Csaba Szepesvári,et al.  Regret Bounds for the Adaptive Control of Linear Quadratic Systems , 2011, COLT.

[61]  Csaba Szepesvári,et al.  Improved Algorithms for Linear Stochastic Bandits , 2011, NIPS.

[62]  Hilbert J. Kappen,et al.  Speedy Q-Learning , 2011, NIPS.

[63]  Wei Chu,et al.  Contextual Bandits with Linear Payoff Functions , 2011, AISTATS.

[64]  Roman Vershynin,et al.  Introduction to the non-asymptotic analysis of random matrices , 2010, Compressed Sensing.

[65]  Csaba Szepesvári,et al.  Algorithms for Reinforcement Learning , 2010, Synthesis Lectures on Artificial Intelligence and Machine Learning.

[66]  Wei Chu,et al.  A contextual-bandit approach to personalized news article recommendation , 2010, WWW '10.

[67]  John N. Tsitsiklis,et al.  Linearly Parameterized Bandits , 2008, Math. Oper. Res..

[68]  Peter Auer,et al.  Near-optimal Regret Bounds for Reinforcement Learning , 2008, J. Mach. Learn. Res..

[69]  Francisco S. Melo,et al.  Q -Learning with Linear Function Approximation , 2007, COLT.

[70]  Lihong Li,et al.  PAC model-free reinforcement learning , 2006, ICML.

[71]  Santosh S. Vempala,et al.  Efficient algorithms for online decision problems , 2005, J. Comput. Syst. Sci..

[72]  Peter Auer,et al.  Using Confidence Bounds for Exploitation-Exploration Trade-offs , 2003, J. Mach. Learn. Res..

[73]  John N. Tsitsiklis,et al.  Analysis of temporal-difference learning with function approximation , 1996, NIPS 1996.

[74]  Leemon C. Baird,et al.  Residual Algorithms: Reinforcement Learning with Function Approximation , 1995, ICML.

[75]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[76]  Reid G. Simmons,et al.  Complexity Analysis of Real-Time Reinforcement Learning , 1993, AAAI.

[77]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[78]  Michael I. Jordan,et al.  On Function Approximation in Reinforcement Learning: Optimism in the Face of Large State Spaces , 2021 .

[79]  Michael I. Jordan,et al.  Provably Efficient Reinforcement Learning with Linear Function Approximation , 2019, COLT.

[80]  Guido Sanguinetti,et al.  Advances in Neural Information Processing Systems 24 , 2011 .

[81]  Berthold Vöcking,et al.  Regret Minimization for Online Buffering Problems Using the Weighted Majority Algorithm , 2010, Electron. Colloquium Comput. Complex..

[82]  Thomas P. Hayes,et al.  Stochastic Linear Optimization under Bandit Feedback , 2008, COLT.

[83]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[84]  Steven J. Bradtke,et al.  Linear Least-Squares algorithms for temporal difference learning , 2004, Machine Learning.

[85]  Andrew W. Moore,et al.  Generalization in Reinforcement Learning: Safely Approximating the Value Function , 1994, NIPS.