Sub-sampling for Efficient Non-Parametric Bandit Exploration

In this paper we propose the first multi-armed bandit algorithm based on re-sampling that achieves asymptotically optimal regret simultaneously for different families of arms (namely Bernoulli, Gaussian and Poisson distributions). Unlike Thompson Sampling which requires to specify a different prior to be optimal in each case, our proposal RB-SDA does not need any distribution-dependent tuning. RB-SDA belongs to the family of Sub-sampling Duelling Algorithms (SDA) which combines the sub-sampling idea first used by the BESA [1] and SSMC [2] algorithms with different sub-sampling schemes. In particular, RB-SDA uses Random Block sampling. We perform an experimental study assessing the flexibility and robustness of this promising novel approach for exploration in bandit models.

[1]  W. R. Thompson ON THE LIKELIHOOD THAT ONE UNKNOWN PROBABILITY EXCEEDS ANOTHER IN VIEW OF THE EVIDENCE OF TWO SAMPLES , 1933 .

[2]  J. Halton,et al.  Algorithm 247: Radical-inverse quasi-random point sequence , 1964, CACM.

[3]  I. Sobol On the distribution of points in a cube and the approximate evaluation of integrals , 1967 .

[4]  S. T. Buckland,et al.  An Introduction to the Bootstrap. , 1994 .

[5]  R. Agrawal Sample mean based index policies by O(log n) regret for the multi-armed bandit problem , 1995, Advances in Applied Probability.

[6]  A. Burnetas,et al.  Optimal Adaptive Policies for Sequential Allocation Problems , 1996 .

[7]  Robert F. Tichy,et al.  Sequences, Discrepancies and Applications , 1997 .

[8]  Jonathan A. Tawn,et al.  Modelling Dependence within Joint Tail Regions , 1997 .

[9]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[10]  Karol Pak,et al.  Stirling Numbers of the Second Kind , 2005 .

[11]  M. Kenward,et al.  An Introduction to the Bootstrap , 2007 .

[12]  Aurélien Garivier,et al.  The KL-UCB Algorithm for Bounded Stochastic Bandits and Beyond , 2011, COLT.

[13]  Rémi Munos,et al.  Thompson Sampling: An Asymptotically Optimal Finite-Time Analysis , 2012, ALT.

[14]  Shipra Agrawal,et al.  Analysis of Thompson Sampling for the Multi-armed Bandit Problem , 2011, COLT.

[15]  R. Munos,et al.  Kullback–Leibler upper confidence bounds for optimal sequential allocation , 2012, 1210.1136.

[16]  Shipra Agrawal,et al.  Further Optimal Regret Bounds for Thompson Sampling , 2012, AISTATS.

[17]  Rémi Munos,et al.  Thompson Sampling for 1-Dimensional Exponential Family Bandits , 2013, NIPS.

[18]  Shie Mannor,et al.  Sub-sampling for Multi-armed Bandits , 2014, ECML/PKDD.

[19]  Akimichi Takemura,et al.  Non-asymptotic analysis of a new bandit algorithm for semi-bounded rewards , 2015, J. Mach. Learn. Res..

[20]  Benjamin Van Roy,et al.  Bootstrapped Thompson Sampling and Deep Exploration , 2015, ArXiv.

[21]  Brian F. Hutton,et al.  What is the distribution of the number of unique original items in a bootstrap sample , 2016, 1602.05822.

[22]  Tor Lattimore,et al.  Garbage In, Reward Out: Bootstrapping Exploration in Multi-Armed Bandits , 2018, ICML.

[23]  Junya Honda,et al.  Bandit Algorithms Based on Thompson Sampling for Bounded Reward Distributions , 2020, ALT.

[24]  H. Chan The multi-armed bandit problem: An efficient nonparametric solution , 2020 .

[25]  Yang Yu,et al.  Residual Bootstrap Exploration for Bandit Algorithms , 2020, ArXiv.

[26]  Csaba Szepesvari,et al.  Bandit Algorithms , 2020 .

[27]  T. L. Lai Andherbertrobbins Asymptotically Efficient Adaptive Allocation Rules , 2022 .