论文信息 - Open Problem: Regret Bounds for Thompson Sampling

Open Problem: Regret Bounds for Thompson Sampling

Contextual multi-armed bandits (Langford and Zhang, 2008) have received substantial interests in recent years due to their wide applications on the Internet, such as new recommendation and advertising. The fundamental challenge here is to balance exploration and exploitation so that the total payoff collected by an algorithm approaches that of an optimal strategy. Exploration techniques like -greedy, UCB (upper confidence bound), and their many variants have been extensively studied. Interestingly, one of the oldest exploration heuristics, dated back to Thompson (1933), has not been popular in the literature until recently when researchers started to realize its effectiveness in critical real-world applications (Scott, 2010; Graepel et al., 2010; May and Leslie, 2011; Chapelle and Li, 2012). This heuristic, known as Thompson sampling, fulfills the principle of “probability matching,” which states that an arm is chosen with the probability that it is the optimal one. A generic description is given in Algorithm 1, where the algorithm maintains a posterior distribution P (θ|D) over a parameter space Θ that defines a set of greedy policies. At every step, a random model θt is drawn from the posterior, and the greedy action according to the payoff predictions of θt is chosen.

Lihong Li | Olivier Chapelle

[1] Shipra Agrawal,et al. Analysis of Thompson Sampling for the Multi-armed Bandit Problem , 2011, COLT.

[2] Lihong Li,et al. An Empirical Evaluation of Thompson Sampling , 2011, NIPS.

[3] Benedict C. May. Simulation Studies in Optimistic Bayesian Sampling in Contextual-Bandit Problems , 2011 .

[4] W. R. Thompson. ON THE LIKELIHOOD THAT ONE UNKNOWN PROBABILITY EXCEEDS ANOTHER IN VIEW OF THE EVIDENCE OF TWO SAMPLES , 1933 .

[5] T. L. Lai Andherbertrobbins. Asymptotically Efficient Adaptive Allocation Rules , 2022 .

[6] John Langford,et al. Efficient Optimal Learning for Contextual Bandits , 2011, UAI.

[7] Peter Auer,et al. The Nonstochastic Multiarmed Bandit Problem , 2002, SIAM J. Comput..

[8] John Langford,et al. Contextual Bandit Learning with Predictable Rewards , 2012, AISTATS.

[9] Ole-Christoffer Granmo,et al. Solving two-armed Bernoulli bandit problems using a Bayesian learning automaton , 2010, Int. J. Intell. Comput. Cybern..

[10] Steven L. Scott,et al. A modern Bayesian look at the multi-armed bandit , 2010 .

[11] J. Langford,et al. The Epoch-Greedy algorithm for contextual multi-armed bandits , 2007, NIPS 2007.

[12] David S. Leslie,et al. Optimistic Bayesian Sampling in Contextual-Bandit Problems , 2012, J. Mach. Learn. Res..

[13] Joaquin Quiñonero Candela,et al. Web-Scale Bayesian Click-Through rate Prediction for Sponsored Search Advertising in Microsoft's Bing Search Engine , 2010, ICML.