Refined Regret for Adversarial MDPs with Linear Function Approximation

We consider learning in an adversarial Markov Decision Process (MDP) where the loss functions can change arbitrarily over $K$ episodes and the state space can be arbitrarily large. We assume that the Q-function of any policy is linear in some known features, that is, a linear function approximation exists. The best existing regret upper bound for this setting (Luo et al., 2021) is of order $\tilde{\mathcal O}(K^{2/3})$ (omitting all other dependencies), given access to a simulator. This paper provides two algorithms that improve the regret to $\tilde{\mathcal O}(\sqrt K)$ in the same setting. Our first algorithm makes use of a refined analysis of the Follow-the-Regularized-Leader (FTRL) algorithm with the log-barrier regularizer. This analysis allows the loss estimators to be arbitrarily negative and might be of independent interest. Our second algorithm develops a magnitude-reduced loss estimator, further removing the polynomial dependency on the number of actions in the first algorithm and leading to the optimal regret bound (up to logarithmic terms and dependency on the horizon). Moreover, we also extend the first algorithm to simulator-free linear MDPs, which achieves $\tilde{\mathcal O}(K^{8/9})$ regret and greatly improves over the best existing bound $\tilde{\mathcal O}(K^{14/15})$. This algorithm relies on a better alternative to the Matrix Geometric Resampling procedure by Neu&Olkhovskaya (2020), which could again be of independent interest.

[1]  Aviv A. Rosenberg,et al.  Delay-Adapted Policy Optimization and Improved Regret for Adversarial MDP with Delayed Bandit Feedback , 2023, ICML.

[2]  Shuai Li,et al.  Improved Regret Bounds for Linear Adversarial MDPs via Linear Optimization , 2023, Trans. Mach. Learn. Res..

[3]  Y. Mansour,et al.  Improved Regret for Efficient Online Reinforcement Learning with Linear Function Approximation , 2023, ICML.

[4]  Kevin G. Jamieson,et al.  Instance-Dependent Near-Optimal Policy Identification in Linear MDPs via Online Experiment Design , 2022, NeurIPS.

[5]  Chen-Yu Wei,et al.  Policy Optimization in Adversarial MDPs: Improved Exploration via Dilated Bonuses , 2021, NeurIPS.

[6]  Shipra Agrawal,et al.  Scale Free Adversarial Multi Armed Bandits , 2021, ALT.

[7]  Alekh Agarwal,et al.  Cautiously Optimistic Policy Optimization and Exploration with Linear Function Approximation , 2021, COLT.

[8]  Quanquan Gu,et al.  Near-optimal Policy Optimization Algorithms for Learning Adversarial Linear Mixture MDPs , 2021, AISTATS.

[9]  Michael I. Jordan,et al.  Provably Efficient Reinforcement Learning with Linear Function Approximation Under Adaptivity Constraints , 2021, NeurIPS.

[10]  Haipeng Luo,et al.  Learning Infinite-horizon Average-reward MDPs with Linear Function Approximation , 2020, AISTATS.

[11]  Wen Sun,et al.  PC-PG: Policy Cover Directed Exploration for Provable Policy Gradient Learning , 2020, NeurIPS.

[12]  Csaba Szepesvari,et al.  Bandit Algorithms , 2020 .

[13]  Gergely Neu,et al.  Online learning in MDPs with linear function approximation and bandit feedback , 2020, NeurIPS.

[14]  Quanquan Gu,et al.  Provably Efficient Reinforcement Learning for Discounted MDPs with Feature Mapping , 2020, ICML.

[15]  Ruosong Wang,et al.  On Reward-Free Reinforcement Learning with Linear Function Approximation , 2020, NeurIPS.

[16]  Shie Mannor,et al.  Optimistic Policy Optimization with Bandit Feedback , 2020, ICML.

[17]  Gergely Neu,et al.  Efficient and Robust Algorithms for Adversarial Linear Contextual Bandits , 2020, COLT.

[18]  Chi Jin,et al.  Provably Efficient Exploration in Policy Optimization , 2019, ICML.

[19]  Chi Jin,et al.  Learning Adversarial Markov Decision Processes with Bandit Feedback and Unknown Transition , 2019, ICML.

[20]  Haipeng Luo,et al.  Equipping Experts/Bandits with Long-term Memory , 2019, NeurIPS.

[21]  Mengdi Wang,et al.  Reinforcement Leaning in Feature Space: Matrix Bandit, Kernels, and Regret Bound , 2019, ICML.

[22]  Peter L. Bartlett,et al.  POLITEX: Regret Bounds for Policy Iteration using Expert Prediction , 2019, ICML.

[23]  Yishay Mansour,et al.  Online Convex Optimization in Adversarial Markov Decision Processes , 2019, ICML.

[24]  Haipeng Luo,et al.  More Adaptive Algorithms for Adversarial Bandits , 2018, COLT.

[25]  Éva Tardos,et al.  Learning in Games: Robustness of Fast Convergence , 2016, NIPS.

[26]  Gergely Neu,et al.  Online learning in episodic Markovian decision processes by relative entropy policy search , 2013, NIPS.

[27]  Sham M. Kakade,et al.  Towards Minimax Policies for Online Linear Optimization with Bandit Feedback , 2012, COLT.

[28]  Joel A. Tropp,et al.  User-Friendly Tail Bounds for Sums of Random Matrices , 2010, Found. Comput. Math..

[29]  L. Meng,et al.  The optimal perturbation bounds of the Moore–Penrose inverse under the Frobenius norm , 2010 .

[30]  Thomas P. Hayes,et al.  The Price of Bandit Information for Online Optimization , 2007, NIPS.

[31]  John Langford,et al.  Approximately Optimal Approximate Reinforcement Learning , 2002, ICML.

[32]  Tor Lattimore,et al.  Return of the bias: Almost minimax optimal high probability bounds for adversarial linear bandits , 2022, COLT.

[33]  Shinji Ito,et al.  Parameter-Free Multi-Armed Bandit Algorithms with Hybrid Data-Dependent Regret Bounds , 2021, COLT.

[34]  Peter Auer,et al.  The Nonstochastic Multiarmed Bandit Problem , 2002, SIAM J. Comput..