Bandits With Heavy Tail

The stochastic multiarmed bandit problem is well understood when the reward distributions are sub-Gaussian. In this paper, we examine the bandit problem under the weaker assumption that the distributions have moments of order 1 + ε, for some ε ∈ (0,1]. Surprisingly, moments of order 2 (i.e., finite variance) are sufficient to obtain regret bounds of the same order as under sub-Gaussian reward distributions. In order to achieve such regret, we define sampling strategies based on refined estimators of the mean such as the truncated empirical mean, Catoni's M-estimator, and the median-of-means estimator. We also derive matching lower bounds that also show that the best achievable regret deteriorates when ε <; 1.

[1]  P. Bickel On Some Robust Estimates of Location , 1965 .

[2]  Frederick R. Forst,et al.  On robust estimation of the location parameter , 1980 .

[3]  Peter J. Huber,et al.  Robust Statistics , 2005, Wiley Series in Probability and Statistics.

[4]  John Darzentas,et al.  Problem Complexity and Method Efficiency in Optimization , 1983 .

[5]  R. Agrawal Sample mean based index policies by O(log n) regret for the multi-armed bandit problem , 1995, Advances in Applied Probability.

[6]  Noga Alon,et al.  The space complexity of approximating the frequency moments , 1996, STOC '96.

[7]  Richard M. Karp,et al.  An Optimal Algorithm for Monte Carlo Estimation , 2000, SIAM J. Comput..

[8]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[9]  Osamu Watanabe,et al.  Adaptive Sampling Methods for Scaling Up Knowledge Discovery Algorithms , 1999, Data Mining and Knowledge Discovery.

[10]  Bart Selman,et al.  Heavy-Tailed Phenomena in Satisfiability and Constraint Satisfaction Problems , 2000, Journal of Automated Reasoning.

[11]  B. Ripley,et al.  Robust Statistics , 2018, Wiley Series in Probability and Statistics.

[12]  Andrew W. Moore,et al.  The Racing Algorithm: Model Selection for Lazy Learners , 1997, Artificial Intelligence Review.

[13]  H. Robbins Some aspects of the sequential design of experiments , 1952 .

[14]  Sébastien Bubeck Bandits Games and Clustering Foundations , 2010 .

[15]  O. Catoni Challenging the empirical mean and empirical variance: a deviation study , 2010, 1009.2048.

[16]  Jürgen Schmidhuber,et al.  Algorithm portfolio selection as a bandit problem with unbounded losses , 2011, Annals of Mathematics and Artificial Intelligence.

[17]  Aurélien Garivier,et al.  The KL-UCB Algorithm for Bounded Stochastic Bandits and Beyond , 2011, COLT.

[18]  Florin Ciucu,et al.  Delay Bounds in Communication Networks With Heavy-Tailed and Self-Similar Traffic , 2009, IEEE Transactions on Information Theory.

[19]  Sébastien Bubeck,et al.  Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems , 2012, Found. Trends Mach. Learn..

[20]  Sattar Vakili,et al.  Deterministic Sequencing of Exploration and Exploitation for Multi-Armed Bandit Problems , 2011, IEEE Journal of Selected Topics in Signal Processing.

[21]  T. L. Lai Andherbertrobbins Asymptotically Efficient Adaptive Allocation Rules , 2022 .