Fighting Bandits with a New Kind of Smoothness

We provide a new analysis framework for the adversarial multi-armed bandit problem. Using the notion of convex smoothing, we define a novel family of algorithms with minimax optimal regret guarantees. First, we show that regularization via the Tsallis entropy, which includes EXP3 as a special case, matches the O(√NT) minimax regret with a smaller constant factor. Second, we show that a wide class of perturbation methods achieve a near-optimal regret as low as O(√NT log N), as long as the perturbation distribution has a bounded hazard function. For example, the Gumbel, Weibull, Frechet, Pareto, and Gamma distributions all satisfy this key property and lead to near-optimal algorithms.

[1]  D. Bertsekas Stochastic optimization problems with nondifferentiable cost functionals , 1973 .

[2]  Ohad Shamir,et al.  Relax and Randomize : From Value to Algorithms , 2012, NIPS.

[3]  Rémi Munos,et al.  Efficient learning by implicit exploration in bandit problems with side observations , 2014, NIPS.

[4]  James Hannan,et al.  4. APPROXIMATION TO RAYES RISK IN REPEATED PLAY , 1958 .

[5]  T. L. Lai Andherbertrobbins Asymptotically Efficient Adaptive Allocation Rules , 2022 .

[6]  Guy Van den Broeck,et al.  Monte-Carlo Tree Search in Poker Using Expected Reward Distributions , 2009, ACML.

[7]  Ambuj Tewari,et al.  Online Linear Optimization via Smoothing , 2014, COLT.

[8]  J. Corcoran Modelling Extremal Events for Insurance and Finance , 2002 .

[9]  Jean-Yves Audibert,et al.  Minimax Policies for Adversarial and Stochastic Bandits. , 2009, COLT 2009.

[10]  Manfred K. Warmuth,et al.  The Weighted Majority Algorithm , 1994, Inf. Comput..

[11]  Thomas P. Hayes,et al.  The Price of Bandit Information for Online Optimization , 2007, NIPS.

[12]  Peter Auer,et al.  The Nonstochastic Multiarmed Bandit Problem , 2002, SIAM J. Comput..

[13]  Luc Devroye,et al.  Prediction by random-walk perturbation , 2013, COLT.

[14]  Wojciech Kotlowski,et al.  Follow the Leader with Dropout Perturbations , 2014, COLT.

[15]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[16]  Gergely Neu,et al.  An Efficient Algorithm for Learning with Semi-bandit Feedback , 2013, ALT.

[17]  Nicolò Cesa-Bianchi,et al.  Gambling in a rigged casino: The adversarial multi-armed bandit problem , 1995, Proceedings of IEEE 36th Annual Foundations of Computer Science.

[18]  J. Bather,et al.  Multi‐Armed Bandit Allocation Indices , 1990 .

[19]  Christian M. Ernst,et al.  Multi-armed Bandit Allocation Indices , 1989 .

[20]  Peter Auer,et al.  Using Confidence Bounds for Exploitation-Exploration Trade-offs , 2003, J. Mach. Learn. Res..

[21]  Santosh S. Vempala,et al.  Efficient algorithms for online decision problems , 2005, J. Comput. Syst. Sci..

[22]  Avrim Blum,et al.  Online Geometric Optimization in the Bandit Setting Against an Adaptive Adversary , 2004, COLT.

[23]  Elad Hazan,et al.  Interior-Point Methods for Full-Information and Bandit Online Learning , 2012, IEEE Transactions on Information Theory.

[24]  Sébastien Bubeck,et al.  Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems , 2012, Found. Trends Mach. Learn..

[25]  PAUL EMBRECHTS,et al.  Modelling of extremal events in insurance and finance , 1994, Math. Methods Oper. Res..

[26]  Jean-Paul Penot,et al.  Sub-hessians, super-hessians and conjugation , 1994 .

[27]  Gábor Lugosi,et al.  Prediction, learning, and games , 2006 .

[28]  Adam Tauman Kalai,et al.  Online convex optimization in the bandit setting: gradient descent without a gradient , 2004, SODA '05.

[29]  John Gittins,et al.  Quantitative Methods in the Planning of Pharmaceutical Research , 1996 .

[30]  C. Tsallis Possible generalization of Boltzmann-Gibbs statistics , 1988 .

[31]  Gábor Lugosi,et al.  Minimax Policies for Combinatorial Prediction Games , 2011, COLT.

[32]  Thomas P. Hayes,et al.  Robbing the bandit: less regret in online geometric optimization against an adaptive adversary , 2006, SODA '06.

[33]  C. Klüppelberg,et al.  Modelling Extremal Events , 1997 .

[34]  Una-May O'Reilly,et al.  Hyperparameter Tuning in Bandit-Based Adaptive Operator Selection , 2012, EvoApplications.

[35]  Tapio Elomaa,et al.  On Following the Perturbed Leader in the Bandit Setting , 2005, ALT.

[36]  H. Robbins Some aspects of the sequential design of experiments , 1952 .