Horizon-Free Reinforcement Learning in Polynomial Time: the Power of Stationary Policies

This paper gives the first polynomial-time algorithm for tabular Markov Decision Processes (MDP) that enjoys a regret bound independent on the planning horizon . Specifically, we consider tabular MDP with S states, A actions, a planning horizon H , total reward bounded by 1, and the agent plays for K episodes. We design an algorithm that achieves an O (cid:16) poly( S , A , log K ) √ K (cid:17) regret in contrast to existing bounds which either has an additional polylog( H ) dependency (Zhang et al., 2021b) or has an exponential dependency on S (Li et al., 2021b). Our result relies on a sequence of new structural lemmas establishing the approximation power, stability, and concentration property of stationary policies, which can have applications in other problems related to Markov chains. ,

[1]  Rahul Jain,et al.  Improved No-Regret Algorithms for Stochastic Shortest Path with Linear MDP , 2021, ICML.

[2]  Yuanzhi Li,et al.  Settling the Horizon-Dependence of Sample Complexity in Reinforcement Learning , 2021, 2021 IEEE 62nd Annual Symposium on Foundations of Computer Science (FOCS).

[3]  Yuxin Chen,et al.  Breaking the Sample Complexity Barrier to Regret-Optimal Model-Free Reinforcement Learning , 2021, NeurIPS.

[4]  Haipeng Luo,et al.  Implicit Finite-Horizon Approximation and Efficient Optimal Algorithms for Stochastic Shortest Path , 2021, NeurIPS.

[5]  Alessandro Lazaric,et al.  Stochastic Shortest Path: Minimax, Parameter-Free and Towards Horizon-Free Regret , 2021, NeurIPS.

[6]  S. Du,et al.  Nearly Horizon-Free Offline Reinforcement Learning , 2021, NeurIPS.

[7]  Michal Valko,et al.  UCB Momentum Q-learning: Correcting the bias without forgetting , 2021, ICML.

[8]  Tengyu Ma,et al.  Fine-Grained Gap-Dependent Bounds for Tabular MDPs via Adaptive Multi-Step Bootstrap , 2021, COLT.

[9]  S. Du,et al.  Is Reinforcement Learning More Difficult Than Bandits? A Near-optimal Algorithm Escaping the Curse of Horizon , 2020, COLT.

[10]  Lin F. Yang,et al.  Q-learning with Logarithmic Regret , 2020, AISTATS.

[11]  Xiangyang Ji,et al.  Variance-Aware Confidence Set: Variance-Dependent Bound for Linear Bandits and Horizon-Free Bound for Linear Mixture MDP , 2021, ArXiv.

[12]  S. Du,et al.  Randomized Exploration is Near-Optimal for Tabular MDP , 2021, ArXiv.

[13]  Xiangyang Ji,et al.  Nearly Minimax Optimal Reward-free Reinforcement Learning , 2020, ArXiv.

[14]  Gergely Neu,et al.  A Unifying View of Optimism in Episodic Reinforcement Learning , 2020, NeurIPS.

[15]  Krzysztof Choromanski,et al.  On Optimism in Model-Based Reinforcement Learning , 2020, ArXiv.

[16]  Lin F. Yang,et al.  Is Long Horizon Reinforcement Learning More Difficult Than Short Horizon Reinforcement Learning? , 2020, ArXiv.

[17]  Xiangyang Ji,et al.  Almost Optimal Model-Free Reinforcement Learning via Reference-Advantage Decomposition , 2020, NeurIPS.

[18]  Chi Jin,et al.  Provably Efficient Exploration in Policy Optimization , 2019, ICML.

[19]  Xiaoyu Chen,et al.  Q-learning with UCB Exploration is Sample Efficient for Infinite-Horizon MDP , 2019, ICLR.

[20]  Xiangyang Ji,et al.  Regret Minimization for Reinforcement Learning by Evaluating the Optimal Bias Function , 2019, NeurIPS.

[21]  Daniel Russo,et al.  Worst-Case Regret Bounds for Exploration via Randomized Value Functions , 2019, NeurIPS.

[22]  Max Simchowitz,et al.  Non-Asymptotic Gap-Dependent Regret Bounds for Tabular MDPs , 2019, NeurIPS.

[23]  Emma Brunskill,et al.  Tighter Problem-Dependent Regret Bounds in Reinforcement Learning without Domain Knowledge using Value Function Bounds , 2019, ICML.

[24]  Lihong Li,et al.  Policy Certificates: Towards Accountable Reinforcement Learning , 2018, ICML.

[25]  Michael I. Jordan,et al.  Is Q-learning Provably Efficient? , 2018, NeurIPS.

[26]  Alessandro Lazaric,et al.  Near Optimal Exploration-Exploitation in Non-Communicating Markov Decision Processes , 2018, NeurIPS.

[27]  Nan Jiang,et al.  Open Problem: The Dependence of Sample Complexity Lower Bounds on Planning Horizon , 2018, COLT.

[28]  Mohammad Sadegh Talebi,et al.  Variance-Aware Regret Bounds for Undiscounted Reinforcement Learning in MDPs , 2018, ALT.

[29]  Shipra Agrawal,et al.  Optimistic posterior sampling for reinforcement learning: worst-case regret bounds , 2022, NIPS.

[30]  Rémi Munos,et al.  Minimax Regret Bounds for Reinforcement Learning , 2017, ICML.

[31]  Tor Lattimore,et al.  Unifying PAC and Regret: Uniform PAC Bounds for Episodic Reinforcement Learning , 2017, NIPS.

[32]  Benjamin Van Roy,et al.  Why is Posterior Sampling Better than Optimism for Reinforcement Learning? , 2016, ICML.

[33]  Christoph Dann,et al.  Sample Complexity of Episodic Fixed-Horizon Reinforcement Learning , 2015, NIPS.

[34]  Benjamin Van Roy,et al.  (More) Efficient Reinforcement Learning via Posterior Sampling , 2013, NIPS.

[35]  Tor Lattimore,et al.  PAC Bounds for Discounted MDPs , 2012, ALT.

[36]  Csaba Szepesvári,et al.  Model-based reinforcement learning with nearly tight exploration complexity bounds , 2010, ICML.

[37]  Ambuj Tewari,et al.  REGAL: A Regularization based Algorithm for Reinforcement Learning in Weakly Communicating MDPs , 2009, UAI.

[38]  Andrew Y. Ng,et al.  Near-Bayesian exploration in polynomial time , 2009, ICML '09.

[39]  Peter Auer,et al.  Near-optimal Regret Bounds for Reinforcement Learning , 2008, J. Mach. Learn. Res..

[40]  Michael L. Littman,et al.  An analysis of model-based Interval Estimation for Markov Decision Processes , 2008, J. Comput. Syst. Sci..

[41]  Michael Kearns,et al.  Near-Optimal Reinforcement Learning in Polynomial Time , 2002, Machine Learning.

[42]  Sham M. Kakade,et al.  On the sample complexity of reinforcement learning. , 2003 .

[43]  Ronen I. Brafman,et al.  R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning , 2001, J. Mach. Learn. Res..

[44]  Shumeet Baluja,et al.  Advances in Neural Information Processing , 1994 .

[45]  D. Freedman On Tail Probabilities for Martingales , 1975 .