Analysis of Thompson Sampling for Stochastic Sleeping Bandits

We study a variant of the stochastic multiarmed bandit problem where the set of available arms varies arbitrarily with time (also known as the sleeping bandit problem). We focus on the Thompson Sampling algorithm and consider a regret notion defined with respect to the best available arm. Our main result is anO(log T ) regret bound for Thompson Sampling, which generalizes a similar bound known for this algorithm from the classical bandit setting. Our bound also matches (up to constants) the best-known lower bound for the sleeping bandit problem. We show via simulations that Thompson Sampling outperforms the UCB-style AUER algorithm for the sleeping bandit problem.

[1]  Lihong Li,et al.  An Empirical Evaluation of Thompson Sampling , 2011, NIPS.

[2]  W. R. Thompson ON THE LIKELIHOOD THAT ONE UNKNOWN PROBABILITY EXCEEDS ANOTHER IN VIEW OF THE EVIDENCE OF TWO SAMPLES , 1933 .

[3]  W. Hoeffding Probability Inequalities for sums of Bounded Random Variables , 1963 .

[4]  Robert D. Kleinberg,et al.  Regret bounds for sleeping experts and bandits , 2010, Machine Learning.

[5]  Shuai Li,et al.  Online Clustering of Bandits , 2014, ICML.

[6]  Lihong Li,et al.  Open Problem: Regret Bounds for Thompson Sampling , 2012, COLT.

[7]  Hiroshi Nakagawa,et al.  Optimal Regret Analysis of Thompson Sampling in Stochastic Multi-armed Bandit Problem with Multiple Plays , 2015, ICML.

[8]  Shipra Agrawal,et al.  Further Optimal Regret Bounds for Thompson Sampling , 2012, AISTATS.

[9]  Rémi Munos,et al.  Thompson Sampling: An Asymptotically Optimal Finite-Time Analysis , 2012, ALT.

[10]  Nenghai Yu,et al.  Thompson Sampling for Budgeted Multi-Armed Bandits , 2015, IJCAI.

[11]  Shuai Li,et al.  Collaborative Filtering Bandits , 2015, SIGIR.

[12]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[13]  Sébastien Bubeck,et al.  Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems , 2012, Found. Trends Mach. Learn..

[14]  Shipra Agrawal,et al.  Analysis of Thompson Sampling for the Multi-armed Bandit Problem , 2011, COLT.

[15]  Thomas Steinke,et al.  Learning hurdles for sleeping experts , 2012, ITCS '12.

[16]  Shipra Agrawal,et al.  Thompson Sampling for Contextual Bandits with Linear Payoffs , 2012, ICML.

[17]  Varun Kanade,et al.  Sleeping Experts and Bandits with Stochastic Action Availability and Adversarial Rewards , 2009, AISTATS.