Scale Free Adversarial Multi Armed Bandits

We consider the Scale-Free Adversarial Multi Armed Bandit(MAB) problem, where the player only knows the number of arms n and not the scale or magnitude of the losses. It sees bandit feedback about the loss vectors l1, . . . , lT ∈ R. The goal is to bound its regret as a function of n and l1, . . . , lT . We design a Follow The Regularized Leader(FTRL) algorithm, which comes with the first scale-free regret guarantee for MAB. It uses the log barrier regularizer, the importance weighted estimator, an adaptive learning rate, and an adaptive exploration parameter. In the analysis, we introduce a simple, unifying technique for obtaining regret inequalities for FTRL and Online Mirror Descent(OMD) on the probability simplex using Potential Functions and Mixed Bregmans. We also develop a new technique for obtaining local-norm lower bounds for Bregman Divergences, which are crucial in bandit regret bounds. These tools could be of independent interest.

[1]  Ohad Shamir,et al.  Bandit Regret Scaling with the Effective Loss Range , 2017, ALT.

[2]  Sébastien Bubeck,et al.  Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems , 2012, Found. Trends Mach. Learn..

[3]  Peter Auer,et al.  Hannan Consistency in On-Line Learning in Case of Unbounded Losses Under Partial Monitoring , 2006, ALT.

[4]  Tor Lattimore,et al.  Exploration by Optimisation in Partial Monitoring , 2019, COLT.

[5]  Haipeng Luo,et al.  Improved Path-length Regret Bounds for Bandits , 2019, COLT.

[6]  Elad Hazan,et al.  Introduction to Online Convex Optimization , 2016, Found. Trends Optim..

[7]  Jean-Yves Audibert,et al.  Minimax Policies for Adversarial and Stochastic Bandits. , 2009, COLT 2009.

[8]  Haipeng Luo,et al.  More Adaptive Algorithms for Adversarial Bandits , 2018, COLT.

[9]  Gábor Lugosi,et al.  Regret in Online Combinatorial Optimization , 2012, Math. Oper. Res..

[10]  Claudio Gentile,et al.  Adaptive and Self-Confident On-Line Learning Algorithms , 2000, J. Comput. Syst. Sci..

[11]  Elad Hazan,et al.  Better Algorithms for Benign Bandits , 2009, J. Mach. Learn. Res..

[12]  Francesco Orabona A Modern Introduction to Online Learning , 2019, ArXiv.

[13]  Csaba Szepesvari,et al.  Bandit Algorithms , 2020 .

[14]  Shai Shalev-Shwartz,et al.  Online Learning and Online Convex Optimization , 2012, Found. Trends Mach. Learn..

[15]  Wouter M. Koolen,et al.  Adaptive Hedge , 2011, NIPS.

[16]  Jürgen Schmidhuber,et al.  Algorithm portfolio selection as a bandit problem with unbounded losses , 2011, Annals of Mathematics and Artificial Intelligence.

[17]  Julian Zimmert,et al.  Tsallis-INF: An Optimal Algorithm for Stochastic and Adversarial Bandits , 2018, J. Mach. Learn. Res..

[18]  Gábor Lugosi,et al.  Minimax Policies for Combinatorial Prediction Games , 2011, COLT.

[19]  Gábor Lugosi,et al.  Prediction, learning, and games , 2006 .

[20]  Tor Lattimore,et al.  On First-Order Bounds, Variance and Gap-Dependent Bounds for Adversarial Bandits , 2019, UAI.

[21]  Yuanzhi Li,et al.  Sparsity, variance and curvature in multi-armed bandits , 2017, ALT.

[22]  Francesco Orabona,et al.  Scale-free online learning , 2016, Theor. Comput. Sci..

[23]  Éva Tardos,et al.  Learning in Games: Robustness of Fast Convergence , 2016, NIPS.

[24]  Aleksandrs Slivkins,et al.  Introduction to Multi-Armed Bandits , 2019, Found. Trends Mach. Learn..

[25]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[26]  Peter Auer,et al.  The Nonstochastic Multiarmed Bandit Problem , 2002, SIAM J. Comput..

[27]  András György,et al.  A Modular Analysis of Adaptive (Non-)Convex Optimization: Optimism, Composite Objectives, and Variational Bounds , 2017, ALT.

[28]  Tor Lattimore,et al.  Refined Lower Bounds for Adversarial Bandits , 2016, NIPS.

[29]  Csaba Szepesvári,et al.  A modular analysis of adaptive (non-)convex optimization: Optimism, composite objectives, variance reduction, and variational bounds , 2020, Theor. Comput. Sci..

[30]  Wouter M. Koolen,et al.  Follow the leader if you can, hedge if you must , 2013, J. Mach. Learn. Res..