论文信息 - Multi-armed bandit problem with known trend

Multi-armed bandit problem with known trend

We consider a variant of the multi-armed bandit model, which we call multi-armed bandit problem with known trend, where the gambler knows the shape of the reward function of each arm but not its distribution. This new problem is motivated by different on-line problems like active learning, music and interface recommendation applications, where when an arm is sampled by the model the received reward change according to a known trend. By adapting the standard multi-armed bandit algorithm UCB1 to take advantage of this setting, we propose the new algorithm named Adjusted Upper Confidence Bound (A-UCB) that assumes a stochastic model. We provide upper bounds of the regret which compare favorably with the ones of UCB1. We also confirm that experimentally with different simulations.

Raphaël Féraud | Djallel Bouneffouf | Djallel Bouneffouf | Raphaël Féraud

[1] Jonathan L. Shapiro,et al. Thompson Sampling in Switching Environments with Bayesian Online Change Point Detection , 2013, AISTATS 2013.

[2] Jonathan L. Shapiro,et al. Thompson Sampling in Switching Environments with Bayesian Online Change Detection , 2013, AISTATS.

[3] Aurélien Garivier,et al. On Upper-Confidence Bound Policies for Non-Stationary Bandit Problems , 2008, 0805.3415.

[4] Michèle Sebag,et al. Multi-armed Bandit, Dynamic Environments and Meta-Bandits , 2006 .

[5] Jason L. Loeppky,et al. Improving Online Marketing Experiments with Drifting Multi-armed Bandits , 2015, ICEIS.

[6] J. Gittins. Bandit processes and dynamic allocation indices , 1979 .

[7] Raphaël Féraud,et al. A Neural Networks Committee for the Contextual Bandit Problem , 2014, ICONIP.

[8] Varun Kanade,et al. Sleeping Experts and Bandits with Stochastic Action Availability and Adversarial Rewards , 2009, AISTATS.

[9] P. Whittle. Restless bandits: activity allocation in a changing world , 1988, Journal of Applied Probability.

[10] Mitsunori Ogihara,et al. NextOne Player: A Music Recommendation System Based on User Behavior , 2011, ISMIR.

[11] J. Niño-Mora. RESTLESS BANDITS, PARTIAL CONSERVATION LAWS AND INDEXABILITY , 2001 .

[12] Romain Laroche,et al. Contextual Bandit for Active Learning: Active Thompson Sampling , 2014, ICONIP.

[13] Pushmeet Kohli,et al. On user behaviour adaptation under interface change , 2014, IUI.

[14] Filip Radlinski,et al. Mortal Multi-Armed Bandits , 2008, NIPS.

[15] T. L. Lai Andherbertrobbins. Asymptotically Efficient Adaptive Allocation Rules , 1985 .

[16] Peter Auer,et al. Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.