25th Annual Conference on Learning Theory The Best of Both Worlds: Stochastic and Adversarial Bandits

We present a new bandit algorithm, SAO (Stochastic and Adversarial Optimal) whose regret is (essentially) optimal both for adversarial rewards and for stochastic rewards. Specifically, SAO combines the O( √ n) worst-case regret of Exp3 (Auer et al., 2002b) and the (poly)logarithmic regret of UCB1 (Auer et al., 2002a) for stochastic rewards. Adversarial rewards and stochastic rewards are the two main settings in the literature on multi-armed bandits (MAB). Prior work on MAB treats them separately, and does not attempt to jointly optimize for both. This result falls into the general agenda to design algorithms that combine the optimal worst-case performance with improved guarantees for “nice” problem instances.

[1]  W. Hoeffding Probability Inequalities for sums of Bounded Random Variables , 1963 .

[2]  D. Freedman On Tail Probabilities for Martingales , 1975 .

[3]  Peter Auer,et al.  The Nonstochastic Multiarmed Bandit Problem , 2002, SIAM J. Comput..

[4]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[5]  Gilles Stoltz Incomplete information and internal regret in prediction of individual sequences , 2005 .

[6]  Gábor Lugosi,et al.  Prediction, learning, and games , 2006 .

[7]  Yishay Mansour,et al.  Improved second-order bounds for prediction with expert advice , 2006, Machine Learning.

[8]  H. Robbins Some aspects of the sequential design of experiments , 1952 .

[9]  Thomas P. Hayes,et al.  Stochastic Linear Optimization under Bandit Feedback , 2008, COLT.

[10]  Eli Upfal,et al.  Multi-Armed Bandits in Metric Spaces ∗ , 2008 .

[11]  Elad Hazan,et al.  Competing in the Dark: An Efficient Algorithm for Bandit Linear Optimization , 2008, COLT.

[12]  Csaba Szepesvári,et al.  Online Optimization in X-Armed Bandits , 2008, NIPS.

[13]  Csaba Szepesvári,et al.  Exploration-exploitation tradeoff using variance estimates in multi-armed bandits , 2009, Theor. Comput. Sci..

[14]  Nikhil R. Devanur,et al.  The price of truthfulness for pay-per-click auctions , 2009, EC '09.

[15]  Moshe Babaioff,et al.  Truthful mechanisms with implicit payment computation , 2010, EC '10.

[16]  Akimichi Takemura,et al.  An Asymptotically Optimal Bandit Algorithm for Bounded Support Models. , 2010, COLT 2010.

[17]  Peter Auer,et al.  UCB revisited: Improved regret bounds for the stochastic multi-armed bandit problem , 2010, Period. Math. Hung..

[18]  Jean-Yves Audibert,et al.  Regret Bounds and Minimax Policies under Partial Monitoring , 2010, J. Mach. Learn. Res..

[19]  Aleksandrs Slivkins,et al.  Sharp dichotomies for regret minimization in metric spaces , 2009, SODA '10.

[20]  Sébastien Bubeck Bandits Games and Clustering Foundations , 2010 .

[21]  Elad Hazan,et al.  Better Algorithms for Benign Bandits , 2009, J. Mach. Learn. Res..

[22]  Rémi Munos,et al.  Adaptive Bandits: Towards the best history-dependent strategy , 2011, AISTATS.

[23]  Vianney Perchet,et al.  The multi-armed bandit problem with covariates , 2011, ArXiv.

[24]  Aleksandrs Slivkins,et al.  Contextual Bandits with Similarity Information , 2009, COLT.

[25]  Aurélien Garivier,et al.  The KL-UCB Algorithm for Bounded Stochastic Bandits and Beyond , 2011, COLT.

[26]  Moshe Babaioff,et al.  Characterizing truthful multi-armed bandit mechanisms: extended abstract , 2008, EC '09.

[27]  T. L. Lai Andherbertrobbins Asymptotically Efficient Adaptive Allocation Rules , 2022 .