Improved High-Probability Regret for Adversarial Bandits with Time-Varying Feedback Graphs

We study high-probability regret bounds for adversarial $K$-armed bandits with time-varying feedback graphs over $T$ rounds. For general strongly observable graphs, we develop an algorithm that achieves the optimal regret $\widetilde{\mathcal{O}}((\sum_{t=1}^T\alpha_t)^{1/2}+\max_{t\in[T]}\alpha_t)$ with high probability, where $\alpha_t$ is the independence number of the feedback graph at round $t$. Compared to the best existing result [Neu, 2015] which only considers graphs with self-loops for all nodes, our result not only holds more generally, but importantly also removes any $\text{poly}(K)$ dependence that can be prohibitively large for applications such as contextual bandits. Furthermore, we also develop the first algorithm that achieves the optimal high-probability regret bound for weakly observable graphs, which even improves the best expected regret bound of [Alon et al., 2015] by removing the $\mathcal{O}(\sqrt{KT})$ term with a refined analysis. Our algorithms are based on the online mirror descent framework, but importantly with an innovative combination of several techniques. Notably, while earlier works use optimistic biased loss estimators for achieving high-probability bounds, we find it important to use a pessimistic one for nodes without self-loop in a strongly observable graph.

[1]  M. Mohri,et al.  Stochastic Online Learning with Feedback Graphs: Finite-Time and Asymptotic Optimality , 2022, NeurIPS.

[2]  J. Honda,et al.  Nearly Optimal Best-of-Both-Worlds Algorithms for Online Learning with Feedback Graphs , 2022, NeurIPS.

[3]  Chihao Zhang,et al.  Understanding Bandits with Graph Feedback , 2021, NeurIPS.

[4]  Csaba Szepesvari,et al.  Bandit Algorithms , 2020 .

[5]  Haipeng Luo,et al.  Bias no more: high-probability data-dependent regret bounds for adversarial bandits and MDPs , 2020, NeurIPS.

[6]  Haipeng Luo,et al.  A Closer Look at Small-loss Bounds for Bandits with Graph Feedback , 2020, COLT.

[7]  Éva Tardos,et al.  Small-loss bounds for online learning with partial information , 2017, COLT.

[8]  Fang Liu,et al.  Reward Maximization Under Uncertainty: Leveraging Side-Observations on Networks , 2017, J. Mach. Learn. Res..

[9]  Tomer Koren,et al.  Online Learning with Feedback Graphs Without the Graphs , 2016, ICML.

[10]  Gergely Neu,et al.  Explore no more: Improved high-probability regret bounds for non-stochastic bandits , 2015, NIPS.

[11]  N. Alon,et al.  Online Learning with Feedback Graphs: Beyond Bandits , 2015, COLT.

[12]  Rémi Munos,et al.  Efficient learning by implicit exploration in bandit problems with side observations , 2014, NIPS.

[13]  Noga Alon,et al.  Nonstochastic Multi-Armed Bandits with Graph-Structured Feedback , 2014, SIAM J. Comput..

[14]  Marc Lelarge,et al.  Leveraging Side Observations in Stochastic Bandits , 2012, UAI.

[15]  Shie Mannor,et al.  From Bandits to Experts: On the Value of Side-Observations , 2011, NIPS.

[16]  John Langford,et al.  Contextual Bandit Algorithms with Supervised Learning Guarantees , 2010, AISTATS.

[17]  Jacob D. Abernethy,et al.  Beating the adaptive bandit with high probability , 2009, 2009 Information Theory and Applications Workshop.

[18]  Thomas P. Hayes,et al.  High-Probability Regret Bounds for Bandit Online Linear Optimization , 2008, COLT.

[19]  J. Langford,et al.  The Epoch-Greedy algorithm for contextual multi-armed bandits , 2007, NIPS 2007.

[20]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1997, EuroCOLT.

[21]  Tor Lattimore,et al.  Return of the bias: Almost minimax optimal high probability bounds for adversarial linear bandits , 2022, COLT.

[22]  Tomer Koren,et al.  Towards Best-of-All-Worlds Online Learning with Feedback Graphs , 2021, NeurIPS.

[23]  Peter Auer,et al.  The Nonstochastic Multiarmed Bandit Problem , 2002, SIAM J. Comput..