Perturbed-History Exploration in Stochastic Multi-Armed Bandits

We propose an online algorithm for cumulative regret minimization in a stochastic multi-armed bandit. The algorithm adds O(t) i.i.d. pseudo-rewards to its history in round t and then pulls the arm with the highest average reward in its perturbed history. Therefore, we call it perturbed-history exploration (PHE). The pseudo-rewards are carefully designed to offset potentially underestimated mean rewards of arms with a high probability. We derive near-optimal gap-dependent and gap-free bounds on the n-round regret of PHE. The key step in our analysis is a novel argument that shows that randomized Bernoulli rewards lead to optimism. Finally, we empirically evaluate PHE and show that it is competitive with state-of-the-art baselines.

[1]  K. Pearson Biometrika , 1902, The American Naturalist.

[2]  W. R. Thompson ON THE LIKELIHOOD THAT ONE UNKNOWN PROBABILITY EXCEEDS ANOTHER IN VIEW OF THE EVIDENCE OF TWO SAMPLES , 1933 .

[3]  James Hannan,et al.  4. APPROXIMATION TO RAYES RISK IN REPEATED PLAY , 1958 .

[4]  Thomas G. Dietterich,et al.  In Advances in Neural Information Processing Systems 12 , 1991, NIPS 1991.

[5]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[6]  Aravaipa Canyon Basin Volume 3 , 2012, Journal of Diabetes Investigation.

[7]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[8]  Santosh S. Vempala,et al.  Efficient algorithms for online decision problems , 2005, Journal of computer and system sciences (Print).

[9]  Marc Toussaint,et al.  Probabilistic inference for solving discrete and continuous state Markov Decision Processes , 2006, ICML.

[10]  Zoubin Ghahramani,et al.  Proceedings of the 24th international conference on Machine learning , 2007, ICML 2007.

[11]  Thomas P. Hayes,et al.  Stochastic Linear Optimization under Bandit Feedback , 2008, COLT.

[12]  Filip Radlinski,et al.  Learning diverse rankings with multi-armed bandits , 2008, ICML '08.

[13]  Elsevier Sdol,et al.  Advances in Applied Mathematics , 2009 .

[14]  Olivier Chapelle,et al.  A dynamic bayesian network click model for web search ranking , 2009, WWW '09.

[15]  Wolfgang Nejdl,et al.  Proceedings of the 18th international conference on World wide web , 2009, WWW 2009.

[16]  Aurélien Garivier,et al.  Parametric Bandits: The Generalized Linear Case , 2010, NIPS.

[17]  A. Sayed,et al.  Foundations and Trends ® in Machine Learning > Vol 7 > Issue 4-5 Ordering Info About Us Alerts Contact Help Log in Adaptation , Learning , and Optimization over Networks , 2011 .

[18]  Peter A. Flach,et al.  Proceedings of the 28th International Conference on Machine Learning , 2011 .

[19]  Csaba Szepesvári,et al.  Improved Algorithms for Linear Stochastic Bandits , 2011, NIPS.

[20]  Vittorio Ferrari,et al.  Advances in Neural Information Processing Systems 24 , 2011 .

[21]  Aurélien Garivier,et al.  The KL-UCB Algorithm for Bounded Stochastic Bandits and Beyond , 2011, COLT.

[22]  Gergely Neu,et al.  An Efficient Algorithm for Learning with Semi-bandit Feedback , 2013, ALT.

[23]  Shipra Agrawal,et al.  Further Optimal Regret Bounds for Thompson Sampling , 2012, AISTATS.

[24]  Shipra Agrawal,et al.  Thompson Sampling for Contextual Bandits with Linear Payoffs , 2012, ICML.

[25]  Shie Mannor,et al.  Thompson Sampling for Complex Online Problems , 2013, ICML.

[26]  Zheng Wen,et al.  Cascading Bandits: Learning to Rank in the Cascade Model , 2015, ICML.

[27]  Long Tran-Thanh,et al.  Efficient Thompson Sampling for Online Matrix-Factorization Recommendation , 2015, NIPS.

[28]  Peter A. Flach,et al.  Advances in Neural Information Processing Systems 28 , 2015 .

[29]  Federica Mandreoli,et al.  Journal of Computer and System Sciences Special Issue on Query Answering on Graph-Structured Data , 2016, Journal of computer and system sciences (Print).

[30]  Kilian Q. Weinberger,et al.  Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48 , 2016 .

[31]  Zheng Wen,et al.  DCM Bandits: Learning to Rank with Multiple Clicks , 2016, ICML.

[32]  Zhi-Hua Zhou,et al.  Online Stochastic Linear Optimization under One-bit Feedback , 2015, ICML.

[33]  Robert D. Nowak,et al.  Scalable Generalized Linear Bandits: Online Computation and Hashing , 2017, NIPS.

[34]  Benjamin Van Roy,et al.  Ensemble Sampling , 2017, NIPS.

[35]  Alessandro Lazaric,et al.  Linear Thompson Sampling Revisited , 2016, AISTATS.

[36]  Steve Hanneke,et al.  Proceedings of the 28th International Conference on Algorithmic Learning Theory , 2017 .

[37]  Lihong Li,et al.  Provable Optimal Algorithms for Generalized Linear Contextual Bandits , 2017, ArXiv.

[38]  Benjamin Van Roy,et al.  A Tutorial on Thompson Sampling , 2017, Found. Trends Mach. Learn..

[39]  Jasper Snoek,et al.  Deep Bayesian Bandits Showdown: An Empirical Comparison of Bayesian Deep Networks for Thompson Sampling , 2018, ICLR.

[40]  Jianfeng Gao,et al.  BBQ-Networks: Efficient Exploration in Deep Reinforcement Learning for Task-Oriented Dialogue Systems , 2016, AAAI.

[41]  Ole J. Mengshoel,et al.  Customized Nonlinear Bandits for Online Response Selection in Neural Conversation Models , 2017, AAAI.

[42]  Tor Lattimore,et al.  Garbage In, Reward Out: Bootstrapping Exploration in Multi-Armed Bandits , 2018, ICML.

[43]  Craig Boutilier,et al.  Perturbed-History Exploration in Stochastic Linear Bandits , 2019, UAI.

[44]  Csaba Szepesvari,et al.  Bandit Algorithms , 2020 .

[45]  T. L. Lai Andherbertrobbins Asymptotically Efficient Adaptive Allocation Rules , 2022 .