Stochastic Shortest Path: Minimax, Parameter-Free and Towards Horizon-Free Regret

We study the problem of learning in the stochastic shortest path (SSP) setting, where an agent seeks to minimize the expected cost accumulated before reaching a goal state. We design a novel model-based algorithm EB-SSP that carefully skews the empirical transitions and perturbs the empirical costs with an exploration bonus to guarantee both optimism and convergence of the associated value iteration scheme. We prove that EB-SSP achieves the minimax regret rate $\widetilde{O}(B_{\star} \sqrt{S A K})$, where $K$ is the number of episodes, $S$ is the number of states, $A$ is the number of actions and $B_{\star}$ bounds the expected cumulative cost of the optimal policy from any state, thus closing the gap with the lower bound. Interestingly, EB-SSP obtains this result while being parameter-free, i.e., it does not require any prior knowledge of $B_{\star}$, nor of $T_{\star}$ which bounds the expected time-to-goal of the optimal policy from any state. Furthermore, we illustrate various cases (e.g., positive costs, or general costs when an order-accurate estimate of $T_{\star}$ is available) where the regret only contains a logarithmic dependence on $T_{\star}$, thus yielding the first horizon-free regret bound beyond the finite-horizon MDP setting.

[1]  Xiangyang Ji,et al.  Is Reinforcement Learning More Difficult Than Bandits? A Near-optimal Algorithm Escaping the Curse of Horizon , 2021, COLT.

[2]  Yishay Mansour,et al.  Online Convex Optimization in Adversarial Markov Decision Processes , 2019, ICML.

[3]  Yishay Mansour,et al.  Stochastic Shortest Path with Adversarially Changing Costs , 2021, IJCAI.

[4]  Alessandro Lazaric,et al.  Efficient Bias-Span-Constrained Exploration-Exploitation in Reinforcement Learning , 2018, ICML.

[5]  Xiaoyu Chen,et al.  Q-learning with UCB Exploration is Sample Efficient for Infinite-Horizon MDP , 2019, ICLR.

[6]  Massimiliano Pontil,et al.  Empirical Bernstein Bounds and Sample-Variance Penalization , 2009, COLT.

[7]  Haipeng Luo,et al.  Minimax Regret for Stochastic Shortest Path with Adversarial Costs and Known Transition , 2020, Annual Conference Computational Learning Theory.

[8]  Yishay Mansour,et al.  Minimax Regret for Stochastic Shortest Path , 2021, NeurIPS.

[9]  Xiangyang Ji,et al.  Model-Free Reinforcement Learning: from Clipped Pseudo-Regret to Sample Complexity , 2020, ICML.

[10]  Dimitri P. Bertsekas,et al.  Stochastic Shortest Path Problems Under Weak Conditions , 2013 .

[11]  Rémi Munos,et al.  Minimax Regret Bounds for Reinforcement Learning , 2017, ICML.

[12]  Ambuj Tewari,et al.  REGAL: A Regularization based Algorithm for Reinforcement Learning in Weakly Communicating MDPs , 2009, UAI.

[13]  Michael I. Jordan,et al.  Is Q-learning Provably Efficient? , 2018, NeurIPS.

[14]  Max Simchowitz,et al.  Non-Asymptotic Gap-Dependent Regret Bounds for Tabular MDPs , 2019, NeurIPS.

[15]  Alessandro Lazaric,et al.  No-Regret Exploration in Goal-Oriented Reinforcement Learning , 2020, ICML.

[16]  Gergely Neu,et al.  A Unifying View of Optimism in Episodic Reinforcement Learning , 2020, NeurIPS.

[17]  Xiangyang Ji,et al.  Almost Optimal Model-Free Reinforcement Learning via Reference-Advantage Decomposition , 2020, NeurIPS.

[18]  Haipeng Luo,et al.  Simultaneously Learning Stochastic and Adversarial Episodic MDPs with Known Transition , 2020, NeurIPS.

[19]  Shie Mannor,et al.  Tight Regret Bounds for Model-Based Reinforcement Learning with Greedy Policies , 2019, NeurIPS.

[20]  Emma Brunskill,et al.  Tighter Problem-Dependent Regret Bounds in Reinforcement Learning without Domain Knowledge using Value Function Bounds , 2019, ICML.

[21]  Haipeng Luo,et al.  Finding the Stochastic Shortest Path with Low Regret: The Adversarial Cost and Unknown Transition Case , 2021, ICML.

[22]  Tengyu Ma,et al.  Fine-Grained Gap-Dependent Bounds for Tabular MDPs via Adaptive Multi-Step Bootstrap , 2021, COLT.

[23]  Haipeng Luo,et al.  Learning Adversarial Markov Decision Processes with Bandit Feedback and Unknown Transition , 2020, ICML.

[24]  Blai Bonet,et al.  On the Speed of Convergence of Value Iteration on Stochastic Shortest-Path Problems , 2007, Math. Oper. Res..

[25]  Alessandro Lazaric,et al.  Improved Sample Complexity for Incremental Autonomous Exploration in MDPs , 2020, NeurIPS.

[26]  Xiangyang Ji,et al.  Variance-Aware Confidence Set: Variance-Dependent Bound for Linear Bandits and Horizon-Free Bound for Linear Mixture MDP , 2021, ArXiv.

[27]  Yishay Mansour,et al.  Online Stochastic Shortest Path with Bandit Feedback and Unknown Transition Function , 2019, NeurIPS.

[28]  Dimitri P. Bertsekas,et al.  On Boundedness of Q-Learning Iterates for Stochastic Shortest Path Problems , 2013, Math. Oper. Res..

[29]  Michal Valko,et al.  UCB Momentum Q-learning: Correcting the bias without forgetting , 2021, ICML.

[30]  Nan Jiang,et al.  Open Problem: The Dependence of Sample Complexity Lower Bounds on Planning Horizon , 2018, COLT.

[31]  Hector Geffner,et al.  Heuristic Search for Generalized Stochastic Shortest Path MDPs , 2011, ICAPS.

[32]  András György,et al.  The adversarial stochastic shortest path problem with unknown transition probabilities , 2012, AISTATS.

[33]  Peter Auer,et al.  Autonomous Exploration For Navigating In MDPs , 2012, COLT.

[34]  Dimitri P. Bertsekas,et al.  Linear network optimization - algorithms and codes , 1991 .

[35]  Alessandro Lazaric,et al.  Exploration Bonus for Regret Minimization in Discrete and Continuous Average Reward MDPs , 2019, NeurIPS.

[36]  Alessandro Lazaric,et al.  Sample Complexity Bounds for Stochastic Shortest Path with a Generative Model , 2021, ALT.

[37]  Ruosong Wang,et al.  Is Long Horizon RL More Difficult Than Short Horizon RL? , 2020, NeurIPS.

[38]  John N. Tsitsiklis,et al.  An Analysis of Stochastic Shortest Path Problems , 1991, Math. Oper. Res..

[39]  Peter Auer,et al.  Near-optimal Regret Bounds for Reinforcement Learning , 2008, J. Mach. Learn. Res..

[40]  Gautier Stauffer,et al.  The Stochastic Shortest Path Problem : A polyhedral combinatorics perspective , 2017, Eur. J. Oper. Res..

[41]  Haipeng Luo,et al.  Model-free Reinforcement Learning in Infinite-horizon Average-reward Markov Decision Processes , 2020, ICML.

[42]  Haim Kaplan,et al.  Near-optimal Regret Bounds for Stochastic Shortest Path , 2020, ICML.

[43]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .