论文信息 - Settling the Sample Complexity of Online Reinforcement Learning - 字舞流文

Settling the Sample Complexity of Online Reinforcement Learning

A central issue lying at the heart of online reinforcement learning (RL) is data efficiency. While a number of recent works achieved asymptotically minimal regret in online RL, the optimality of these results is only guaranteed in a ``large-sample'' regime, imposing enormous burn-in cost in order for their algorithms to operate optimally. How to achieve minimax-optimal regret without incurring any burn-in cost has been an open problem in RL theory. We settle this problem for the context of finite-horizon inhomogeneous Markov decision processes. Specifically, we prove that a modified version of Monotonic Value Propagation (MVP), a model-based algorithm proposed by \cite{zhang2020reinforcement}, achieves a regret on the order of (modulo log factors) \begin{equation*} \min\big\{ \sqrt{SAH^3K}, \,HK \big\}, \end{equation*} where $S$ is the number of states, $A$ is the number of actions, $H$ is the planning horizon, and $K$ is the total number of episodes. This regret matches the minimax lower bound for the entire range of sample size $K\geq 1$, essentially eliminating any burn-in requirement. It also translates to a PAC sample complexity (i.e., the number of episodes needed to yield $\varepsilon$-accuracy) of $\frac{SAH^3}{\varepsilon^2}$ up to log factor, which is minimax-optimal for the full $\varepsilon$-range. Further, we extend our theory to unveil the influences of problem-dependent quantities like the optimal value/cost and certain variances. The key technical innovation lies in the development of a new regret decomposition strategy and a novel analysis paradigm to decouple complicated statistical dependency -- a long-standing challenge facing the analysis of online RL in the sample-hungry regime.

Jason D. Lee | S. Du | Yuxin Chen | Zihan Zhang

[1] Yuxin Chen,et al. The Curious Price of Distributional Robustness in Reinforcement Learning with a Generative Model , 2023, ArXiv.

[2] Wen Sun,et al. The Benefits of Being Distributional: Small-Loss Bounds for Reinforcement Learning , 2023, ArXiv.

[3] Gen Li,et al. Regret-Optimal Model-Free Reinforcement Learning for Discounted MDPs with Short Burn-In Time , 2023, ArXiv.

[4] Yuxin Chen,et al. Minimax-Optimal Reward-Agnostic Exploration in Reinforcement Learning , 2023, ArXiv.

[5] Quanquan Gu,et al. Variance-Dependent Regret Bounds for Linear Bandits and Reinforcement Learning: Adaptivity and Computational Efficiency , 2023, COLT.

[6] S. Du,et al. Sharp Variance-Dependent Bounds in Reinforcement Learning: Best of Both Worlds in Stochastic and Deterministic Environments , 2023, ICML.

[7] Yuxin Chen,et al. Minimax-Optimal Multi-Agent RL in Markov Games With a Generative Model , 2022, NeurIPS.

[8] S. Du,et al. On Gap-dependent Bounds for Offline Reinforcement Learning , 2022, NeurIPS.

[9] Yuxin Chen,et al. Settling the Sample Complexity of Model-Based Offline Reinforcement Learning , 2022, ArXiv.

[10] S. Du,et al. Horizon-Free Reinforcement Learning in Polynomial Time: the Power of Stationary Policies , 2022, COLT.

[11] Jianqing Fan,et al. The Efficacy of Pessimism in Asynchronous Q-Learning , 2022, IEEE Transactions on Information Theory.

[12] Yu-Xiang Wang,et al. Near-optimal Offline Reinforcement Learning with Linear Representation: Leveraging Variance Information with Pessimism , 2022, ICLR.

[13] Yuxin Chen,et al. Pessimistic Q-Learning for Offline Reinforcement Learning: Towards Optimal Sample Complexity , 2022, ICML.

[14] Kevin G. Jamieson,et al. First-Order Regret in Reinforcement Learning with Linear Function Approximation: A Robust Estimation Approach , 2021, ICML.

[15] Yuanzhi Li,et al. Settling the Horizon-Dependence of Sample Complexity in Reinforcement Learning , 2021, 2021 IEEE 62nd Annual Symposium on Foundations of Computer Science (FOCS).

[16] Yuxin Chen,et al. Breaking the Sample Complexity Barrier to Regret-Optimal Model-Free Reinforcement Learning , 2021, NeurIPS.

[17] Julian Zimmert,et al. Beyond Value-Function Gaps: Improved Instance-Dependent Regret Bounds for Episodic Reinforcement Learning , 2021, NeurIPS.

[18] Alessandro Lazaric,et al. A Fully Problem-Dependent Regret Lower Bound for Finite-Horizon MDPs , 2021, ArXiv.

[19] Haipeng Luo,et al. Implicit Finite-Horizon Approximation and Efficient Optimal Algorithms for Stochastic Shortest Path , 2021, NeurIPS.

[20] Caiming Xiong,et al. Policy Finetuning: Bridging Sample-Efficient Offline and Online Reinforcement Learning , 2021, NeurIPS.

[21] Alessandro Lazaric,et al. Stochastic Shortest Path: Minimax, Parameter-Free and Towards Horizon-Free Regret , 2021, NeurIPS.

[22] S. Du,et al. Nearly Horizon-Free Offline Reinforcement Learning , 2021, NeurIPS.

[23] Stuart J. Russell,et al. Bridging Offline Reinforcement Learning and Imitation Learning: A Tale of Pessimism , 2021, IEEE Transactions on Information Theory.

[24] Michal Valko,et al. UCB Momentum Q-learning: Correcting the bias without forgetting , 2021, ICML.

[25] Simon S. Du,et al. Near-Optimal Randomized Exploration for Tabular Markov Decision Processes , 2021, NeurIPS.

[26] Ee,et al. Is Q-Learning Minimax Optimal? A Tight Sample Complexity Analysis , 2021, Operations Research.

[27] Tengyu Ma,et al. Fine-Grained Gap-Dependent Bounds for Tabular MDPs via Adaptive Multi-Step Bootstrap , 2021, COLT.

[28] Martin J. Wainwright,et al. Instance-Dependent ℓ∞-Bounds for Policy Evaluation in Tabular Reinforcement Learning , 2021, IEEE Transactions on Information Theory.

[29] Zhuoran Yang,et al. Is Pessimism Provably Efficient for Offline RL? , 2020, ICML.

[30] Lin F. Yang,et al. Minimax Sample Complexity for Turn-based Stochastic Game , 2020, UAI.

[31] Michal Valko,et al. Episodic Reinforcement Learning in Finite MDPs: Minimax Lower Bounds Revisited , 2020, ALT.

[32] S. Du,et al. Is Reinforcement Learning More Difficult Than Bandits? A Near-optimal Algorithm Escaping the Curse of Horizon , 2020, COLT.

[33] Gergely Neu,et al. A Unifying View of Optimism in Episodic Reinforcement Learning , 2020, NeurIPS.

[34] Lin F. Yang,et al. Q-learning with Logarithmic Regret , 2020, AISTATS.

[35] Haipeng Luo,et al. Bias no more: high-probability data-dependent regret bounds for adversarial bandits and MDPs , 2020, NeurIPS.

[36] Yuxin Chen,et al. Sample Complexity of Asynchronous Q-Learning: Sharper Analysis and Variance Reduction , 2020, IEEE Transactions on Information Theory.

[37] Yuxin Chen,et al. Breaking the Sample Size Barrier in Model-Based Reinforcement Learning with a Generative Model , 2020, NeurIPS.

[38] S. Levine,et al. Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems , 2020, ArXiv.

[39] Lin F. Yang,et al. Is Long Horizon Reinforcement Learning More Difficult Than Short Horizon Reinforcement Learning? , 2020, ArXiv.

[40] Xiangyang Ji,et al. Almost Optimal Model-Free Reinforcement Learning via Reference-Advantage Decomposition , 2020, NeurIPS.

[41] Akshay Krishnamurthy,et al. Reward-Free Exploration for Reinforcement Learning , 2020, ICML.

[42] Siva Theja Maguluri,et al. Finite-Sample Analysis of Contractive Stochastic Approximation Using Smooth Convex Envelopes , 2020, NeurIPS.

[43] Adam Wierman,et al. Finite-Time Analysis of Asynchronous Stochastic Approximation and Q-Learning , 2020, COLT.

[44] Chi Jin,et al. Provably Efficient Exploration in Policy Optimization , 2019, ICML.

[45] Martin J. Wainwright,et al. Variance-reduced Q-learning is minimax optimal , 2019, ArXiv.

[46] Lin F. Yang,et al. Model-Based Reinforcement Learning with a Generative Model is Minimax Optimal , 2019, COLT.

[47] Daniel Russo,et al. Worst-Case Regret Bounds for Exploration via Randomized Value Functions , 2019, NeurIPS.

[48] Yu Bai,et al. Provably Efficient Q-Learning with Low Switching Cost , 2019, NeurIPS.

[49] Martin J. Wainwright,et al. Stochastic approximation with cone-contractive operators: Sharp $\ell_\infty$-bounds for $Q$-learning , 2019, 1905.06265.

[50] Shie Mannor,et al. Tight Regret Bounds for Model-Based Reinforcement Learning with Greedy Policies , 2019, NeurIPS.

[51] Max Simchowitz,et al. Non-Asymptotic Gap-Dependent Regret Bounds for Tabular MDPs , 2019, NeurIPS.

[52] Xiaoyu Chen,et al. Q-learning with UCB Exploration is Sample Efficient for Infinite-Horizon MDP , 2019, ICLR.

[53] Emma Brunskill,et al. Tighter Problem-Dependent Regret Bounds in Reinforcement Learning without Domain Knowledge using Value Function Bounds , 2019, ICML.

[54] Lihong Li,et al. Policy Certificates: Towards Accountable Reinforcement Learning , 2018, ICML.

[55] Michael I. Jordan,et al. Is Q-learning Provably Efficient? , 2018, NeurIPS.

[56] Nan Jiang,et al. Open Problem: The Dependence of Sample Complexity Lower Bounds on Planning Horizon , 2018, COLT.

[57] Mohammad Sadegh Talebi,et al. Variance-Aware Regret Bounds for Undiscounted Reinforcement Learning in MDPs , 2018, ALT.

[58] Alessandro Lazaric,et al. Efficient Bias-Span-Constrained Exploration-Exploitation in Reinforcement Learning , 2018, ICML.

[59] Yuanzhi Li,et al. Make the Minority Great Again: First-Order Regret Bound for Contextual Bandits , 2018, ICML.

[60] Xian Wu,et al. Variance reduced value iteration and faster algorithms for solving Markov decision processes , 2017, SODA.

[61] John Langford,et al. Open Problem: First-Order Regret Bounds for Contextual Bandits , 2017, COLT.

[62] Shipra Agrawal,et al. Optimistic posterior sampling for reinforcement learning: worst-case regret bounds , 2022, NIPS.

[63] Rémi Munos,et al. Minimax Regret Bounds for Reinforcement Learning , 2017, ICML.

[64] Tor Lattimore,et al. Unifying PAC and Regret: Uniform PAC Bounds for Episodic Reinforcement Learning , 2017, NIPS.

[65] Christoph Dann,et al. Sample Complexity of Episodic Fixed-Horizon Reinforcement Learning , 2015, NIPS.

[66] Benjamin Van Roy,et al. (More) Efficient Reinforcement Learning via Posterior Sampling , 2013, NIPS.

[67] R. Srikant,et al. Error bounds for constant step-size Q-learning , 2012, Syst. Control. Lett..

[68] Hilbert J. Kappen,et al. On the Sample Complexity of Reinforcement Learning with a Generative Model , 2012, ICML.

[69] Tor Lattimore,et al. PAC Bounds for Discounted MDPs , 2012, ALT.

[70] Csaba Szepesvári,et al. Model-based reinforcement learning with nearly tight exploration complexity bounds , 2010, ICML.

[71] Massimiliano Pontil,et al. Empirical Bernstein Bounds and Sample-Variance Penalization , 2009, COLT.

[72] Ambuj Tewari,et al. REGAL: A Regularization based Algorithm for Reinforcement Learning in Weakly Communicating MDPs , 2009, UAI.

[73] Andrew Y. Ng,et al. Near-Bayesian exploration in polynomial time , 2009, ICML '09.

[74] Peter Auer,et al. Near-optimal Regret Bounds for Reinforcement Learning , 2008, J. Mach. Learn. Res..

[75] Michael L. Littman,et al. An analysis of model-based Interval Estimation for Markov Decision Processes , 2008, J. Comput. Syst. Sci..

[76] Lihong Li,et al. PAC model-free reinforcement learning , 2006, ICML.

[77] Yishay Mansour,et al. Learning Rates for Q-learning , 2004, J. Mach. Learn. Res..

[78] Ronen I. Brafman,et al. R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning , 2001, J. Mach. Learn. Res..

[79] Michael Kearns,et al. Finite-Sample Convergence Rates for Q-Learning and Indirect Algorithms , 1998, NIPS.

[80] Michael Kearns,et al. Near-Optimal Reinforcement Learning in Polynomial Time , 2002, Machine Learning.

[81] Martin J. Wainwright,et al. Stochastic approximation with cone-contractive operators: Sharp 𝓁∞-bounds for Q-learning , 2019, ArXiv.

[82] D. Bertsekas. Reinforcement Learning and Optimal ControlA Selective Overview , 2018 .

[83] Xian Wu,et al. Near-Optimal Time and Sample Complexities for Solving Markov Decision Processes with a Generative Model , 2018, NeurIPS.

[84] Sham M. Kakade,et al. On the sample complexity of reinforcement learning. , 2003 .