论文信息 - Sliding-Window Thompson Sampling for Non-Stationary Settings - 字舞流文

Sliding-Window Thompson Sampling for Non-Stationary Settings

Multi-Armed Bandit (MAB) techniques have been successfully applied to many classes of sequential decision problems in the past decades. However, non-stationary settings— very common in real-world applications—received little attention so far, and theoretical guarantees on the regret are known only for some frequentist algorithms. In this paper, we propose an algorithm, namely Sliding-Window Thompson Sampling (SW-TS), for nonstationary stochastic MAB settings. Our algorithm is based on Thompson Sampling and exploits a sliding-window approach to tackle, in a unified fashion, two different forms of non-stationarity studied separately so far: abruptly changing and smoothly changing. In the former, the reward distributions are constant during sequences of rounds, and their change may be arbitrary and happen at unknown rounds, while, in the latter, the reward distributions smoothly evolve over rounds according to unknown dynamics. Under mild assumptions, we provide regret upper bounds on the dynamic pseudo-regret of SW-TS for the abruptly changing environment, for the smoothly changing one, and for the setting in which both the non-stationarity forms are present. Furthermore, we empirically show that SW-TS dramatically outperforms state-of-the-art algorithms even when the forms of non-stationarity are taken separately, as previously studied in the literature.

Nicola Gatti | Marcello Restelli | Francesco Trovò | Stefano Paladino | Marcello Restelli | N. Gatti | F. Trovò | Stefano Paladino

[1] Omar Besbes,et al. Stochastic Multi-Armed-Bandit Problem with Non-stationary Rewards , 2014, NIPS.

[2] J. Eliashberg,et al. The Impact of Competitive Entry in a Developing Market Upon Dynamic Pricing Strategies , 1986 .

[3] Marcello Restelli,et al. Dealing with Interdependencies and Uncertainty in Multi-Channel Advertising Campaigns Optimization , 2019, WWW.

[4] Peter Auer,et al. Adaptively Tracking the Best Bandit Arm with an Unknown Number of Distribution Changes , 2019, COLT.

[5] Shipra Agrawal,et al. Analysis of Thompson Sampling for the Multi-armed Bandit Problem , 2011, COLT.

[6] Marcello Restelli,et al. Regret Minimization Algorithms for the Followers Behaviour Identification in Leadership Games , 2017, UAI.

[7] Rémi Munos,et al. Thompson Sampling: An Asymptotically Optimal Finite-Time Analysis , 2012, ALT.

[8] Haipeng Luo,et al. Efficient Contextual Bandits in Non-stationary Worlds , 2017, COLT.

[9] Eli Upfal,et al. Adapting to a Changing Environment: the Brownian Restless Bandits , 2008, COLT.

[10] W. R. Thompson. ON THE LIKELIHOOD THAT ONE UNKNOWN PROBABILITY EXCEEDS ANOTHER IN VIEW OF THE EVIDENCE OF TWO SAMPLES , 1933 .

[11] Aurélien Garivier,et al. On Bayesian Upper Confidence Bounds for Bandit Problems , 2012, AISTATS.

[12] Alexandre Proutière,et al. Unimodal Bandits: Regret Lower Bounds and Optimal Algorithms , 2014, ICML.

[13] Aurélien Garivier,et al. On Upper-Confidence Bound Policies for Non-Stationary Bandit Problems , 2008, 0805.3415.

[14] Nicola Gatti,et al. Driving Exploration by Maximum Distribution in Gaussian Process Bandits , 2020, AAMAS.

[15] Nicola Gatti,et al. Learning Probably Approximately Correct Maximin Strategies in Simulation-Based Games with Infinite Strategy Spaces , 2020, AAMAS.

[16] Jonathan L. Shapiro,et al. Thompson Sampling in Switching Environments with Bayesian Online Change Point Detection , 2013, AISTATS 2013.

[17] Lilian Besson,et al. The Generalized Likelihood Ratio Test meets klUCB: an Improved Algorithm for Piece-Wise Non-Stationary Bandits , 2019, ArXiv.

[18] Michèle Sebag,et al. Multi-armed Bandit, Dynamic Environments and Meta-Bandits , 2006 .

[19] Fang Liu,et al. A Change-Detection based Framework for Piecewise-stationary Multi-Armed Bandit Problem , 2017, AAAI.

[20] Ole-Christoffer Granmo,et al. Solving two-armed Bernoulli bandit problems using a Bayesian learning automaton , 2010, Int. J. Intell. Comput. Cybern..

[21] Marcello Restelli,et al. Targeting Optimization for Internet Advertising by Learning from Logged Bandit Feedback , 2018, 2018 International Joint Conference on Neural Networks (IJCNN).

[22] Lihong Li,et al. An Empirical Evaluation of Thompson Sampling , 2011, NIPS.

[23] W. Hoeffding. Probability Inequalities for sums of Bounded Random Variables , 1963 .

[24] Marcello Restelli,et al. Unimodal Thompson Sampling for Graph-Structured Arms , 2017, AAAI.

[25] H. Vincent Poor,et al. Cognitive Medium Access: Exploration, Exploitation, and Competition , 2007, IEEE Transactions on Mobile Computing.

[26] Marcello Restelli,et al. Improving multi-armed bandit algorithms in online pricing settings , 2018, Int. J. Approx. Reason..

[27] Marcello Restelli,et al. A Combinatorial-Bandit Algorithm for the Online Joint Bid/Budget Optimization of Pay-per-Click Advertising Campaigns , 2018, AAAI.

[28] Brendan Kitts,et al. Optimal Bidding on Keyword Auctions , 2004, Electron. Mark..

[29] Chen-Yu Wei,et al. Tracking the Best Expert in Non-stationary Stochastic Environments , 2017, NIPS.

[30] Eric Moulines,et al. On Upper-Confidence Bound Policies for Switching Bandit Problems , 2011, ALT.

[31] Raphaël Féraud,et al. The non-stationary stochastic multi-armed bandit problem , 2017, International Journal of Data Science and Analytics.

[32] P. N. Rao,et al. Clinical Resistance to STI-571 Cancer Therapy Caused by BCR-ABL Gene Mutation or Amplification , 2001, Science.

[33] Nicola Gatti,et al. Truthful learning mechanisms for multi-slot sponsored search auctions with externalities , 2012, Artif. Intell..

[34] David S. Leslie,et al. Optimistic Bayesian Sampling in Contextual-Bandit Problems , 2012, J. Mach. Learn. Res..

[35] Aleksandrs Slivkins,et al. Contextual Bandits with Similarity Information , 2009, COLT.

[36] Ole-Christoffer Granmo,et al. Solving Non-Stationary Bandit Problems by Random Sampling from Sibling Kalman Filters , 2010, IEA/AIE.

[37] Peter Auer,et al. Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[38] T. L. Lai Andherbertrobbins. Asymptotically Efficient Adaptive Allocation Rules , 2022 .

[39] Nicola Gatti,et al. Adopting the Cascade Model in Ad Auctions: Efficiency Bounds and Truthful Algorithmic Mechanisms , 2017, J. Artif. Intell. Res..