论文信息 - Old Dog Learns New Tricks: Randomized UCB for Bandit Problems - 字舞流文

Old Dog Learns New Tricks: Randomized UCB for Bandit Problems

We propose $\tt RandUCB$, a bandit strategy that uses theoretically derived confidence intervals similar to upper confidence bound (UCB) algorithms, but akin to Thompson sampling (TS), uses randomization to trade off exploration and exploitation. In the $K$-armed bandit setting, we show that there are infinitely many variants of $\tt RandUCB$, all of which achieve the minimax-optimal $\widetilde{O}(\sqrt{K T})$ regret after $T$ rounds. Moreover, in a specific multi-armed bandit setting, we show that both UCB and TS can be recovered as special cases of $\tt RandUCB.$ For structured bandits, where each arm is associated with a $d$-dimensional feature vector and rewards are distributed according to a linear or generalized linear model, we prove that $\tt RandUCB$ achieves the minimax-optimal $\widetilde{O}(d \sqrt{T})$ regret even in the case of infinite arms. We demonstrate the practical effectiveness of $\tt RandUCB$ with experiments in both the multi-armed and structured bandit settings. Our results illustrate that $\tt RandUCB$ matches the empirical performance of TS while obtaining the theoretically optimal regret bounds of UCB algorithms, thus achieving the best of both worlds.

Audrey Durand | Branislav Kveton | Abbas Mehrabian | Sharan Vaswani

[1] John Langford,et al. The Epoch-Greedy Algorithm for Multi-armed Bandits with Side Information , 2007, NIPS.

[2] Thomas P. Hayes,et al. Stochastic Linear Optimization under Bandit Feedback , 2008, COLT.

[3] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[4] John N. Tsitsiklis,et al. Linearly Parameterized Bandits , 2008, Math. Oper. Res..

[5] W. R. Thompson. ON THE LIKELIHOOD THAT ONE UNKNOWN PROBABILITY EXCEEDS ANOTHER IN VIEW OF THE EVIDENCE OF TWO SAMPLES , 1933 .

[6] Ambuj Tewari,et al. On the Optimality of Perturbations in Stochastic and Adversarial Multi-armed Bandit Problems , 2019, NeurIPS.

[7] Aurélien Garivier,et al. Parametric Bandits: The Generalized Linear Case , 2010, NIPS.

[8] Csaba Szepesvári,et al. Improved Algorithms for Linear Stochastic Bandits , 2011, NIPS.

[9] Zheng Wen,et al. New Insights into Bootstrapping for Bandits , 2018, ArXiv.

[10] Claudio Gentile,et al. Boltzmann Exploration Done Right , 2017, NIPS.

[11] H. Robbins,et al. Asymptotically efficient adaptive allocation rules , 1985 .

[12] Shie Mannor,et al. Thompson Sampling for Complex Online Problems , 2013, ICML.

[13] Peter Auer,et al. Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[14] Dean Eckles,et al. Thompson sampling with the online bootstrap , 2014, ArXiv.

[15] Chih-Wei Hsu,et al. Empirical Bayes Regret Minimization , 2019, ArXiv.

[16] Wei Chu,et al. An unbiased offline evaluation of contextual bandit algorithms with generalized linear models , 2011 .

[17] Alessandro Lazaric,et al. Linear Thompson Sampling Revisited , 2016, AISTATS.

[18] Benjamin Van Roy,et al. Bootstrapped Thompson Sampling and Deep Exploration , 2015, ArXiv.

[19] Robert D. Nowak,et al. Scalable Generalized Linear Bandits: Online Computation and Hashing , 2017, NIPS.

[20] Eric R. Ziegel,et al. Generalized Linear Models , 2002, Technometrics.

[21] Tor Lattimore,et al. Garbage In, Reward Out: Bootstrapping Exploration in Multi-Armed Bandits , 2018, ICML.

[22] Zhi-Hua Zhou,et al. Online Stochastic Linear Optimization under One-bit Feedback , 2015, ICML.

[23] M. Woodroofe. A One-Armed Bandit Problem with a Concomitant Variable , 1979 .

[24] Wei Chu,et al. A contextual-bandit approach to personalized news article recommendation , 2010, WWW '10.

[25] Csaba Szepesvári,et al. Bandit Based Monte-Carlo Planning , 2006, ECML.

[26] Lihong Li,et al. Provable Optimal Algorithms for Generalized Linear Contextual Bandits , 2017, ArXiv.

[27] Michèle Sebag,et al. Exploration vs Exploitation vs Safety: Risk-Aware Multi-Armed Bandits , 2013, ACML.

[28] Craig Boutilier,et al. Randomized Exploration in Generalized Linear Bandits , 2019, AISTATS.

[29] Craig Boutilier,et al. Perturbed-History Exploration in Stochastic Linear Bandits , 2019, UAI.

[30] Long Tran-Thanh,et al. Efficient Thompson Sampling for Online Matrix-Factorization Recommendation , 2015, NIPS.

[31] Shipra Agrawal,et al. Near-Optimal Regret Bounds for Thompson Sampling , 2017, J. ACM.

[32] Lihong Li,et al. An Empirical Evaluation of Thompson Sampling , 2011, NIPS.

[33] W. Hoeffding. Probability Inequalities for sums of Bounded Random Variables , 1963 .

[34] Jasper Snoek,et al. Deep Bayesian Bandits Showdown: An Empirical Comparison of Bayesian Deep Networks for Thompson Sampling , 2018, ICLR.

[35] Shipra Agrawal,et al. Thompson Sampling for Contextual Bandits with Linear Payoffs , 2012, ICML.

[36] Shie Mannor,et al. Sub-sampling for Multi-armed Bandits , 2014, ECML/PKDD.

[37] Liang Tang,et al. Personalized Recommendation via Parameter-Free Contextual Bandits , 2015, SIGIR.

[38] Benjamin Van Roy,et al. Ensemble Sampling , 2017, NIPS.

[39] Aurélien Garivier,et al. The KL-UCB Algorithm for Bounded Stochastic Bandits and Beyond , 2011, COLT.