Settling the Sample Complexity of Online Reinforcement Learning

A central issue lying at the heart of online reinforcement learning (RL) is data efficiency. While a number of recent works achieved asymptotically minimal regret in online RL, the optimality of these results is only guaranteed in a ``large-sample'' regime, imposing enormous burn-in cost in order for their algorithms to operate optimally. How to achieve minimax-optimal regret without incurring any burn-in cost has been an open problem in RL theory. We settle this problem for the context of finite-horizon inhomogeneous Markov decision processes. Specifically, we prove that a modified version of Monotonic Value Propagation (MVP), a model-based algorithm proposed by \cite{zhang2020reinforcement}, achieves a regret on the order of (modulo log factors) \begin{equation*} \min\big\{ \sqrt{SAH^3K}, \,HK \big\}, \end{equation*} where $S$ is the number of states, $A$ is the number of actions, $H$ is the planning horizon, and $K$ is the total number of episodes. This regret matches the minimax lower bound for the entire range of sample size $K\geq 1$, essentially eliminating any burn-in requirement. It also translates to a PAC sample complexity (i.e., the number of episodes needed to yield $\varepsilon$-accuracy) of $\frac{SAH^3}{\varepsilon^2}$ up to log factor, which is minimax-optimal for the full $\varepsilon$-range. Further, we extend our theory to unveil the influences of problem-dependent quantities like the optimal value/cost and certain variances. The key technical innovation lies in the development of a new regret decomposition strategy and a novel analysis paradigm to decouple complicated statistical dependency -- a long-standing challenge facing the analysis of online RL in the sample-hungry regime.

[1]  Yuxin Chen,et al.  The Curious Price of Distributional Robustness in Reinforcement Learning with a Generative Model , 2023, ArXiv.

[2]  Wen Sun,et al.  The Benefits of Being Distributional: Small-Loss Bounds for Reinforcement Learning , 2023, ArXiv.

[3]  Gen Li,et al.  Regret-Optimal Model-Free Reinforcement Learning for Discounted MDPs with Short Burn-In Time , 2023, ArXiv.

[4]  Yuxin Chen,et al.  Minimax-Optimal Reward-Agnostic Exploration in Reinforcement Learning , 2023, ArXiv.

[5]  Quanquan Gu,et al.  Variance-Dependent Regret Bounds for Linear Bandits and Reinforcement Learning: Adaptivity and Computational Efficiency , 2023, COLT.

[6]  S. Du,et al.  Sharp Variance-Dependent Bounds in Reinforcement Learning: Best of Both Worlds in Stochastic and Deterministic Environments , 2023, ICML.

[7]  Yuxin Chen,et al.  Minimax-Optimal Multi-Agent RL in Markov Games With a Generative Model , 2022, NeurIPS.

[8]  S. Du,et al.  On Gap-dependent Bounds for Offline Reinforcement Learning , 2022, NeurIPS.

[9]  Yuxin Chen,et al.  Settling the Sample Complexity of Model-Based Offline Reinforcement Learning , 2022, ArXiv.

[10]  S. Du,et al.  Horizon-Free Reinforcement Learning in Polynomial Time: the Power of Stationary Policies , 2022, COLT.

[11]  Jianqing Fan,et al.  The Efficacy of Pessimism in Asynchronous Q-Learning , 2022, IEEE Transactions on Information Theory.

[12]  Yu-Xiang Wang,et al.  Near-optimal Offline Reinforcement Learning with Linear Representation: Leveraging Variance Information with Pessimism , 2022, ICLR.

[13]  Yuxin Chen,et al.  Pessimistic Q-Learning for Offline Reinforcement Learning: Towards Optimal Sample Complexity , 2022, ICML.

[14]  Kevin G. Jamieson,et al.  First-Order Regret in Reinforcement Learning with Linear Function Approximation: A Robust Estimation Approach , 2021, ICML.

[15]  Yuanzhi Li,et al.  Settling the Horizon-Dependence of Sample Complexity in Reinforcement Learning , 2021, 2021 IEEE 62nd Annual Symposium on Foundations of Computer Science (FOCS).

[16]  Yuxin Chen,et al.  Breaking the Sample Complexity Barrier to Regret-Optimal Model-Free Reinforcement Learning , 2021, NeurIPS.

[17]  Julian Zimmert,et al.  Beyond Value-Function Gaps: Improved Instance-Dependent Regret Bounds for Episodic Reinforcement Learning , 2021, NeurIPS.

[18]  Alessandro Lazaric,et al.  A Fully Problem-Dependent Regret Lower Bound for Finite-Horizon MDPs , 2021, ArXiv.

[19]  Haipeng Luo,et al.  Implicit Finite-Horizon Approximation and Efficient Optimal Algorithms for Stochastic Shortest Path , 2021, NeurIPS.

[20]  Caiming Xiong,et al.  Policy Finetuning: Bridging Sample-Efficient Offline and Online Reinforcement Learning , 2021, NeurIPS.

[21]  Alessandro Lazaric,et al.  Stochastic Shortest Path: Minimax, Parameter-Free and Towards Horizon-Free Regret , 2021, NeurIPS.

[22]  S. Du,et al.  Nearly Horizon-Free Offline Reinforcement Learning , 2021, NeurIPS.

[23]  Stuart J. Russell,et al.  Bridging Offline Reinforcement Learning and Imitation Learning: A Tale of Pessimism , 2021, IEEE Transactions on Information Theory.

[24]  Michal Valko,et al.  UCB Momentum Q-learning: Correcting the bias without forgetting , 2021, ICML.

[25]  Simon S. Du,et al.  Near-Optimal Randomized Exploration for Tabular Markov Decision Processes , 2021, NeurIPS.

[26]  Ee,et al.  Is Q-Learning Minimax Optimal? A Tight Sample Complexity Analysis , 2021, Operations Research.

[27]  Tengyu Ma,et al.  Fine-Grained Gap-Dependent Bounds for Tabular MDPs via Adaptive Multi-Step Bootstrap , 2021, COLT.

[28]  Martin J. Wainwright,et al.  Instance-Dependent ℓ∞-Bounds for Policy Evaluation in Tabular Reinforcement Learning , 2021, IEEE Transactions on Information Theory.

[29]  Zhuoran Yang,et al.  Is Pessimism Provably Efficient for Offline RL? , 2020, ICML.

[30]  Lin F. Yang,et al.  Minimax Sample Complexity for Turn-based Stochastic Game , 2020, UAI.

[31]  Michal Valko,et al.  Episodic Reinforcement Learning in Finite MDPs: Minimax Lower Bounds Revisited , 2020, ALT.

[32]  S. Du,et al.  Is Reinforcement Learning More Difficult Than Bandits? A Near-optimal Algorithm Escaping the Curse of Horizon , 2020, COLT.

[33]  Gergely Neu,et al.  A Unifying View of Optimism in Episodic Reinforcement Learning , 2020, NeurIPS.

[34]  Lin F. Yang,et al.  Q-learning with Logarithmic Regret , 2020, AISTATS.

[35]  Haipeng Luo,et al.  Bias no more: high-probability data-dependent regret bounds for adversarial bandits and MDPs , 2020, NeurIPS.

[36]  Yuxin Chen,et al.  Sample Complexity of Asynchronous Q-Learning: Sharper Analysis and Variance Reduction , 2020, IEEE Transactions on Information Theory.

[37]  Yuxin Chen,et al.  Breaking the Sample Size Barrier in Model-Based Reinforcement Learning with a Generative Model , 2020, NeurIPS.

[38]  S. Levine,et al.  Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems , 2020, ArXiv.

[39]  Lin F. Yang,et al.  Is Long Horizon Reinforcement Learning More Difficult Than Short Horizon Reinforcement Learning? , 2020, ArXiv.

[40]  Xiangyang Ji,et al.  Almost Optimal Model-Free Reinforcement Learning via Reference-Advantage Decomposition , 2020, NeurIPS.

[41]  Akshay Krishnamurthy,et al.  Reward-Free Exploration for Reinforcement Learning , 2020, ICML.

[42]  Siva Theja Maguluri,et al.  Finite-Sample Analysis of Contractive Stochastic Approximation Using Smooth Convex Envelopes , 2020, NeurIPS.

[43]  Adam Wierman,et al.  Finite-Time Analysis of Asynchronous Stochastic Approximation and Q-Learning , 2020, COLT.

[44]  Chi Jin,et al.  Provably Efficient Exploration in Policy Optimization , 2019, ICML.

[45]  Martin J. Wainwright,et al.  Variance-reduced Q-learning is minimax optimal , 2019, ArXiv.

[46]  Lin F. Yang,et al.  Model-Based Reinforcement Learning with a Generative Model is Minimax Optimal , 2019, COLT.

[47]  Daniel Russo,et al.  Worst-Case Regret Bounds for Exploration via Randomized Value Functions , 2019, NeurIPS.

[48]  Yu Bai,et al.  Provably Efficient Q-Learning with Low Switching Cost , 2019, NeurIPS.

[49]  Martin J. Wainwright,et al.  Stochastic approximation with cone-contractive operators: Sharp $\ell_\infty$-bounds for $Q$-learning , 2019, 1905.06265.

[50]  Shie Mannor,et al.  Tight Regret Bounds for Model-Based Reinforcement Learning with Greedy Policies , 2019, NeurIPS.

[51]  Max Simchowitz,et al.  Non-Asymptotic Gap-Dependent Regret Bounds for Tabular MDPs , 2019, NeurIPS.

[52]  Xiaoyu Chen,et al.  Q-learning with UCB Exploration is Sample Efficient for Infinite-Horizon MDP , 2019, ICLR.

[53]  Emma Brunskill,et al.  Tighter Problem-Dependent Regret Bounds in Reinforcement Learning without Domain Knowledge using Value Function Bounds , 2019, ICML.

[54]  Lihong Li,et al.  Policy Certificates: Towards Accountable Reinforcement Learning , 2018, ICML.

[55]  Michael I. Jordan,et al.  Is Q-learning Provably Efficient? , 2018, NeurIPS.

[56]  Nan Jiang,et al.  Open Problem: The Dependence of Sample Complexity Lower Bounds on Planning Horizon , 2018, COLT.

[57]  Mohammad Sadegh Talebi,et al.  Variance-Aware Regret Bounds for Undiscounted Reinforcement Learning in MDPs , 2018, ALT.

[58]  Alessandro Lazaric,et al.  Efficient Bias-Span-Constrained Exploration-Exploitation in Reinforcement Learning , 2018, ICML.

[59]  Yuanzhi Li,et al.  Make the Minority Great Again: First-Order Regret Bound for Contextual Bandits , 2018, ICML.

[60]  Xian Wu,et al.  Variance reduced value iteration and faster algorithms for solving Markov decision processes , 2017, SODA.

[61]  John Langford,et al.  Open Problem: First-Order Regret Bounds for Contextual Bandits , 2017, COLT.

[62]  Shipra Agrawal,et al.  Optimistic posterior sampling for reinforcement learning: worst-case regret bounds , 2022, NIPS.

[63]  Rémi Munos,et al.  Minimax Regret Bounds for Reinforcement Learning , 2017, ICML.

[64]  Tor Lattimore,et al.  Unifying PAC and Regret: Uniform PAC Bounds for Episodic Reinforcement Learning , 2017, NIPS.

[65]  Christoph Dann,et al.  Sample Complexity of Episodic Fixed-Horizon Reinforcement Learning , 2015, NIPS.

[66]  Benjamin Van Roy,et al.  (More) Efficient Reinforcement Learning via Posterior Sampling , 2013, NIPS.

[67]  R. Srikant,et al.  Error bounds for constant step-size Q-learning , 2012, Syst. Control. Lett..

[68]  Hilbert J. Kappen,et al.  On the Sample Complexity of Reinforcement Learning with a Generative Model , 2012, ICML.

[69]  Tor Lattimore,et al.  PAC Bounds for Discounted MDPs , 2012, ALT.

[70]  Csaba Szepesvári,et al.  Model-based reinforcement learning with nearly tight exploration complexity bounds , 2010, ICML.

[71]  Massimiliano Pontil,et al.  Empirical Bernstein Bounds and Sample-Variance Penalization , 2009, COLT.

[72]  Ambuj Tewari,et al.  REGAL: A Regularization based Algorithm for Reinforcement Learning in Weakly Communicating MDPs , 2009, UAI.

[73]  Andrew Y. Ng,et al.  Near-Bayesian exploration in polynomial time , 2009, ICML '09.

[74]  Peter Auer,et al.  Near-optimal Regret Bounds for Reinforcement Learning , 2008, J. Mach. Learn. Res..

[75]  Michael L. Littman,et al.  An analysis of model-based Interval Estimation for Markov Decision Processes , 2008, J. Comput. Syst. Sci..

[76]  Lihong Li,et al.  PAC model-free reinforcement learning , 2006, ICML.

[77]  Yishay Mansour,et al.  Learning Rates for Q-learning , 2004, J. Mach. Learn. Res..

[78]  Ronen I. Brafman,et al.  R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning , 2001, J. Mach. Learn. Res..

[79]  Michael Kearns,et al.  Finite-Sample Convergence Rates for Q-Learning and Indirect Algorithms , 1998, NIPS.

[80]  Michael Kearns,et al.  Near-Optimal Reinforcement Learning in Polynomial Time , 2002, Machine Learning.

[81]  Martin J. Wainwright,et al.  Stochastic approximation with cone-contractive operators: Sharp 𝓁∞-bounds for Q-learning , 2019, ArXiv.

[82]  D. Bertsekas Reinforcement Learning and Optimal ControlA Selective Overview , 2018 .

[83]  Xian Wu,et al.  Near-Optimal Time and Sample Complexities for Solving Markov Decision Processes with a Generative Model , 2018, NeurIPS.

[84]  Sham M. Kakade,et al.  On the sample complexity of reinforcement learning. , 2003 .