The non-stationary stochastic multi-armed bandit problem

We consider a variant of the stochastic multi-armed bandit with K arms where the rewards are not assumed to be identically distributed, but are generated by a non-stationary stochastic process. We first study the unique best arm setting when there exists one unique best arm. Second, we study the general switching best arm setting when a best arm switches at some unknown steps. For both settings, we target problem-dependent bounds, instead of the more conservative problem-free bounds. We consider two classical problems: (1) identify a best arm with high probability (best arm identification), for which the performance measure by the sample complexity (number of samples before finding a near-optimal arm). To this end, we naturally extend the definition of sample complexity so that it makes sense in the switching best arm setting, which may be of independent interest. (2) Achieve the smallest cumulative regret (regret minimization) where the regret is measured with respect to the strategy pulling an arm with the best instantaneous mean at each step.

[1]  Aurélien Garivier,et al.  On the Complexity of Best-Arm Identification in Multi-Armed Bandit Models , 2014, J. Mach. Learn. Res..

[2]  Raphaël Féraud,et al.  EXP3 with drift detection for the switching bandit problem , 2015, 2015 IEEE International Conference on Data Science and Advanced Analytics (DSAA).

[3]  R. Serfling Probability Inequalities for the Sum in Sampling without Replacement , 1974 .

[4]  Shie Mannor,et al.  Action Elimination and Stopping Conditions for the Multi-Armed Bandit and Reinforcement Learning Problems , 2006, J. Mach. Learn. Res..

[5]  Sébastien Bubeck,et al.  Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems , 2012, Found. Trends Mach. Learn..

[6]  Aurélien Garivier,et al.  On Upper-Confidence Bound Policies for Non-Stationary Bandit Problems , 2008, 0805.3415.

[7]  W. R. Thompson ON THE LIKELIHOOD THAT ONE UNKNOWN PROBABILITY EXCEEDS ANOTHER IN VIEW OF THE EVIDENCE OF TWO SAMPLES , 1933 .

[8]  Eric Moulines,et al.  On Upper-Confidence Bound Policies for Switching Bandit Problems , 2011, ALT.

[9]  Aleksandrs Slivkins,et al.  One Practical Algorithm for Both Stochastic and Adversarial Bandits , 2014, ICML.

[10]  Shie Mannor,et al.  Piecewise-stationary bandit problems with side observations , 2009, ICML '09.

[11]  Gergely Neu,et al.  Explore no more: Improved high-probability regret bounds for non-stochastic bandits , 2015, NIPS.

[12]  Tor Lattimore,et al.  On Explore-Then-Commit strategies , 2016, NIPS.

[13]  Raphaël Féraud,et al.  Random Forest for the Contextual Bandit Problem , 2015, AISTATS.

[14]  Michèle Sebag,et al.  Multi-armed Bandit, Dynamic Environments and Meta-Bandits , 2006 .

[15]  Peter Auer,et al.  The Nonstochastic Multiarmed Bandit Problem , 2002, SIAM J. Comput..

[16]  Aleksandrs Slivkins,et al.  25th Annual Conference on Learning Theory The Best of Both Worlds: Stochastic and Adversarial Bandits , 2022 .

[17]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[18]  W. Hoeffding Probability Inequalities for sums of Bounded Random Variables , 1963 .