Beyond the Hazard Rate: More Perturbation Algorithms for Adversarial Multi-armed Bandits

Recent work on follow the perturbed leader (FTPL) algorithms for the adversarial multi-armed bandit problem has highlighted the role of the hazard rate of the distribution generating the perturbations. Assuming that the hazard rate is bounded, it is possible to provide regret analyses for a variety of FTPL algorithms for the multi-armed bandit problem. This paper pushes the inquiry into regret bounds for FTPL algorithms beyond the bounded hazard rate condition. There are good reasons to do so: natural distributions such as the uniform and Gaussian violate the condition. We give regret bounds for both bounded support and unbounded support distributions without assuming the hazard rate condition. We also disprove a conjecture that the Gaussian distribution cannot lead to a low-regret algorithm. In fact, it turns out that it leads to near optimal regret, up to logarithmic factors. A key ingredient in our approach is the introduction of a new notion called the generalized hazard rate.

[1]  Tapio Elomaa,et al.  On Following the Perturbed Leader in the Bandit Setting , 2005, ALT.

[2]  Jan Poland,et al.  FPL Analysis for Adaptive Bandits , 2005, SAGA.

[3]  Santosh S. Vempala,et al.  Efficient algorithms for online decision problems , 2005, J. Comput. Syst. Sci..

[4]  H. Robbins Some aspects of the sequential design of experiments , 1952 .

[5]  Wojciech Kotlowski,et al.  Follow the Leader with Dropout Perturbations , 2014, COLT.

[6]  Luc Devroye,et al.  Prediction by random-walk perturbation , 2013, COLT.

[7]  M. Bagnoli,et al.  Log-concave probability and its applications , 2004 .

[8]  Árpád Baricz,et al.  Mills' ratio: Monotonicity patterns and functional inequalities , 2008 .

[9]  Gergely Neu,et al.  An Efficient Algorithm for Learning with Semi-bandit Feedback , 2013, ALT.

[10]  C. EWARTA. Sufficient Conditions for Monotone Hazard Rate An Application to Latency-Probability Curves , 2003 .

[11]  Ambuj Tewari,et al.  Online Linear Optimization via Smoothing , 2014, COLT.

[12]  James Hannan,et al.  4. APPROXIMATION TO RAYES RISK IN REPEATED PLAY , 1958 .

[13]  Sébastien Bubeck,et al.  Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems , 2012, Found. Trends Mach. Learn..

[14]  Ambuj Tewari,et al.  Fighting Bandits with a New Kind of Smoothness , 2015, NIPS.

[15]  Gábor Lugosi,et al.  Minimax Policies for Combinatorial Prediction Games , 2011, COLT.

[16]  S. Foss,et al.  An Introduction to Heavy-Tailed and Subexponential Distributions , 2011 .

[17]  D. Bertsekas Stochastic optimization problems with nondifferentiable cost functionals , 1973 .

[18]  Richard A. Chechile,et al.  Mathematical tools for hazard function analysis , 2003 .

[19]  Y. Freund,et al.  The non-stochastic multi-armed bandit problem , 2001 .

[20]  Richard A. Chechile CorrigendumCorrigendum to: “Mathematical tools for hazard function analysis” [J. Math. Psychol. 47 (2003) 478–494] , 2009 .