Minimax Regret for Stochastic Shortest Path with Adversarial Costs and Known Transition

We study the stochastic shortest path problem with adversarial costs and known transition, and show that the minimax regret is $\widetilde{O}(\sqrt{DT^\star K})$ and $\widetilde{O}(\sqrt{DT^\star SA K})$ for the full-information setting and the bandit feedback setting respectively, where $D$ is the diameter, $T^\star$ is the expected hitting time of the optimal policy, $S$ is the number of states, $A$ is the number of actions, and $K$ is the number of episodes. Our results significantly improve upon the existing work of (Rosenberg and Mansour, 2020) which only considers the full-information setting and achieves suboptimal regret. Our work is also the first to consider bandit feedback with adversarial costs. Our algorithms are built on top of the Online Mirror Descent framework with a variety of new techniques that might be of independent interest, including an improved multi-scale expert algorithm, a reduction from general stochastic shortest path to a special loop-free case, a skewed occupancy measure space, %the usage of log-barrier with an increasing learning rate schedule, and a novel correction term added to the cost estimators. Interestingly, the last two elements reduce the variance of the learner via positive bias and the variance of the optimal policy via negative bias respectively, and having them simultaneously is critical for obtaining the optimal high-probability bound in the bandit feedback setting.

[1]  Haipeng Luo,et al.  Finding the Stochastic Shortest Path with Low Regret: The Adversarial Cost and Unknown Transition Case , 2021, ICML.

[2]  Impossible Tuning Made Possible: A New Expert Algorithm and Its Applications , 2021, COLT.

[3]  Haipeng Luo,et al.  Learning Adversarial Markov Decision Processes with Bandit Feedback and Unknown Transition , 2020, ICML.

[4]  Csaba Szepesvari,et al.  Bandit Algorithms , 2020 .

[5]  Yishay Mansour,et al.  Adversarial Stochastic Shortest Path , 2020, ArXiv.

[6]  Haipeng Luo,et al.  Bias no more: high-probability data-dependent regret bounds for adversarial bandits and MDPs , 2020, NeurIPS.

[7]  Haim Kaplan,et al.  Near-optimal Regret Bounds for Stochastic Shortest Path , 2020, ICML.

[8]  Shie Mannor,et al.  Optimistic Policy Optimization with Bandit Feedback , 2020, ICML.

[9]  Haipeng Luo,et al.  A Closer Look at Small-loss Bounds for Bandits with Graph Feedback , 2020, COLT.

[10]  Alessandro Lazaric,et al.  No-Regret Exploration in Goal-Oriented Reinforcement Learning , 2019, ICML.

[11]  Yishay Mansour,et al.  Online Convex Optimization in Adversarial Markov Decision Processes , 2019, ICML.

[12]  Wojciech Kotlowski,et al.  Bandit Principal Component Analysis , 2019, COLT.

[13]  Haipeng Luo,et al.  Improved Path-length Regret Bounds for Bandits , 2019, COLT.

[14]  Emma Brunskill,et al.  Tighter Problem-Dependent Regret Bounds in Reinforcement Learning without Domain Knowledge using Value Function Bounds , 2019, ICML.

[15]  Michael I. Jordan,et al.  Is Q-learning Provably Efficient? , 2018, NeurIPS.

[16]  Haipeng Luo,et al.  Efficient Online Portfolio with Logarithmic Regret , 2018, NeurIPS.

[17]  Francesco Orabona,et al.  Black-Box Reductions for Parameter-free Online Learning in Banach Spaces , 2018, COLT.

[18]  Haipeng Luo,et al.  More Adaptive Algorithms for Adversarial Bandits , 2018, COLT.

[19]  Mehryar Mohri,et al.  Parameter-Free Online Learning via Model Selection , 2017, NIPS.

[20]  Nikhil R. Devanur,et al.  Online Auctions and Multi-scale Online Learning , 2017, EC.

[21]  Rémi Munos,et al.  Minimax Regret Bounds for Reinforcement Learning , 2017, ICML.

[22]  Haipeng Luo,et al.  Corralling a Band of Bandit Algorithms , 2016, COLT.

[23]  Éva Tardos,et al.  Learning in Games: Robustness of Fast Convergence , 2016, NIPS.

[24]  Tor Lattimore,et al.  Refined Lower Bounds for Adversarial Bandits , 2016, NIPS.

[25]  Percy Liang,et al.  Adaptivity and Optimism: An Improved Exponentiated Gradient Algorithm , 2014, ICML.

[26]  Gergely Neu,et al.  Online learning in episodic Markovian decision processes by relative entropy policy search , 2013, NIPS.

[27]  Dimitri P. Bertsekas,et al.  Stochastic Shortest Path Problems Under Weak Conditions , 2013 .

[28]  András György,et al.  The adversarial stochastic shortest path problem with unknown transition probabilities , 2012, AISTATS.

[29]  John Langford,et al.  Contextual Bandit Algorithms with Supervised Learning Guarantees , 2010, AISTATS.

[30]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1997, EuroCOLT.

[31]  John N. Tsitsiklis,et al.  An Analysis of Stochastic Shortest Path Problems , 1991, Math. Oper. Res..