论文信息 - Horizon-Free Reinforcement Learning in Polynomial Time: the Power of Stationary Policies - 字舞流文

Horizon-Free Reinforcement Learning in Polynomial Time: the Power of Stationary Policies

This paper gives the first polynomial-time algorithm for tabular Markov Decision Processes (MDP) that enjoys a regret bound independent on the planning horizon . Specifically, we consider tabular MDP with S states, A actions, a planning horizon H , total reward bounded by 1, and the agent plays for K episodes. We design an algorithm that achieves an O (cid:16) poly( S , A , log K ) √ K (cid:17) regret in contrast to existing bounds which either has an additional polylog( H ) dependency (Zhang et al., 2021b) or has an exponential dependency on S (Li et al., 2021b). Our result relies on a sequence of new structural lemmas establishing the approximation power, stability, and concentration property of stationary policies, which can have applications in other problems related to Markov chains. ,

S. Du | Xiangyang Ji | Zihan Zhang

[1] Rahul Jain,et al. Improved No-Regret Algorithms for Stochastic Shortest Path with Linear MDP , 2021, ICML.

[2] Yuanzhi Li,et al. Settling the Horizon-Dependence of Sample Complexity in Reinforcement Learning , 2021, 2021 IEEE 62nd Annual Symposium on Foundations of Computer Science (FOCS).

[3] Yuxin Chen,et al. Breaking the Sample Complexity Barrier to Regret-Optimal Model-Free Reinforcement Learning , 2021, NeurIPS.

[4] Haipeng Luo,et al. Implicit Finite-Horizon Approximation and Efficient Optimal Algorithms for Stochastic Shortest Path , 2021, NeurIPS.

[5] Alessandro Lazaric,et al. Stochastic Shortest Path: Minimax, Parameter-Free and Towards Horizon-Free Regret , 2021, NeurIPS.

[6] S. Du,et al. Nearly Horizon-Free Offline Reinforcement Learning , 2021, NeurIPS.

[7] Michal Valko,et al. UCB Momentum Q-learning: Correcting the bias without forgetting , 2021, ICML.

[8] Tengyu Ma,et al. Fine-Grained Gap-Dependent Bounds for Tabular MDPs via Adaptive Multi-Step Bootstrap , 2021, COLT.

[9] S. Du,et al. Is Reinforcement Learning More Difficult Than Bandits? A Near-optimal Algorithm Escaping the Curse of Horizon , 2020, COLT.

[10] Lin F. Yang,et al. Q-learning with Logarithmic Regret , 2020, AISTATS.

[11] Xiangyang Ji,et al. Variance-Aware Confidence Set: Variance-Dependent Bound for Linear Bandits and Horizon-Free Bound for Linear Mixture MDP , 2021, ArXiv.

[12] S. Du,et al. Randomized Exploration is Near-Optimal for Tabular MDP , 2021, ArXiv.

[13] Xiangyang Ji,et al. Nearly Minimax Optimal Reward-free Reinforcement Learning , 2020, ArXiv.

[14] Gergely Neu,et al. A Unifying View of Optimism in Episodic Reinforcement Learning , 2020, NeurIPS.

[15] Krzysztof Choromanski,et al. On Optimism in Model-Based Reinforcement Learning , 2020, ArXiv.

[16] Lin F. Yang,et al. Is Long Horizon Reinforcement Learning More Difficult Than Short Horizon Reinforcement Learning? , 2020, ArXiv.

[17] Xiangyang Ji,et al. Almost Optimal Model-Free Reinforcement Learning via Reference-Advantage Decomposition , 2020, NeurIPS.

[18] Chi Jin,et al. Provably Efficient Exploration in Policy Optimization , 2019, ICML.

[19] Xiaoyu Chen,et al. Q-learning with UCB Exploration is Sample Efficient for Infinite-Horizon MDP , 2019, ICLR.

[20] Xiangyang Ji,et al. Regret Minimization for Reinforcement Learning by Evaluating the Optimal Bias Function , 2019, NeurIPS.

[21] Daniel Russo,et al. Worst-Case Regret Bounds for Exploration via Randomized Value Functions , 2019, NeurIPS.

[22] Max Simchowitz,et al. Non-Asymptotic Gap-Dependent Regret Bounds for Tabular MDPs , 2019, NeurIPS.

[23] Emma Brunskill,et al. Tighter Problem-Dependent Regret Bounds in Reinforcement Learning without Domain Knowledge using Value Function Bounds , 2019, ICML.

[24] Lihong Li,et al. Policy Certificates: Towards Accountable Reinforcement Learning , 2018, ICML.

[25] Michael I. Jordan,et al. Is Q-learning Provably Efficient? , 2018, NeurIPS.

[26] Alessandro Lazaric,et al. Near Optimal Exploration-Exploitation in Non-Communicating Markov Decision Processes , 2018, NeurIPS.

[27] Nan Jiang,et al. Open Problem: The Dependence of Sample Complexity Lower Bounds on Planning Horizon , 2018, COLT.

[28] Mohammad Sadegh Talebi,et al. Variance-Aware Regret Bounds for Undiscounted Reinforcement Learning in MDPs , 2018, ALT.

[29] Shipra Agrawal,et al. Optimistic posterior sampling for reinforcement learning: worst-case regret bounds , 2022, NIPS.

[30] Rémi Munos,et al. Minimax Regret Bounds for Reinforcement Learning , 2017, ICML.

[31] Tor Lattimore,et al. Unifying PAC and Regret: Uniform PAC Bounds for Episodic Reinforcement Learning , 2017, NIPS.

[32] Benjamin Van Roy,et al. Why is Posterior Sampling Better than Optimism for Reinforcement Learning? , 2016, ICML.

[33] Christoph Dann,et al. Sample Complexity of Episodic Fixed-Horizon Reinforcement Learning , 2015, NIPS.

[34] Benjamin Van Roy,et al. (More) Efficient Reinforcement Learning via Posterior Sampling , 2013, NIPS.

[35] Tor Lattimore,et al. PAC Bounds for Discounted MDPs , 2012, ALT.

[36] Csaba Szepesvári,et al. Model-based reinforcement learning with nearly tight exploration complexity bounds , 2010, ICML.

[37] Ambuj Tewari,et al. REGAL: A Regularization based Algorithm for Reinforcement Learning in Weakly Communicating MDPs , 2009, UAI.

[38] Andrew Y. Ng,et al. Near-Bayesian exploration in polynomial time , 2009, ICML '09.

[39] Peter Auer,et al. Near-optimal Regret Bounds for Reinforcement Learning , 2008, J. Mach. Learn. Res..

[40] Michael L. Littman,et al. An analysis of model-based Interval Estimation for Markov Decision Processes , 2008, J. Comput. Syst. Sci..

[41] Michael Kearns,et al. Near-Optimal Reinforcement Learning in Polynomial Time , 2002, Machine Learning.

[42] Sham M. Kakade,et al. On the sample complexity of reinforcement learning. , 2003 .

[43] Ronen I. Brafman,et al. R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning , 2001, J. Mach. Learn. Res..

[44] Shumeet Baluja,et al. Advances in Neural Information Processing , 1994 .

[45] D. Freedman. On Tail Probabilities for Martingales , 1975 .