Open Problem: Regret Bounds for Thompson Sampling

Contextual multi-armed bandits (Langford and Zhang, 2008) have received substantial interests in recent years due to their wide applications on the Internet, such as new recommendation and advertising. The fundamental challenge here is to balance exploration and exploitation so that the total payoff collected by an algorithm approaches that of an optimal strategy. Exploration techniques like -greedy, UCB (upper confidence bound), and their many variants have been extensively studied. Interestingly, one of the oldest exploration heuristics, dated back to Thompson (1933), has not been popular in the literature until recently when researchers started to realize its effectiveness in critical real-world applications (Scott, 2010; Graepel et al., 2010; May and Leslie, 2011; Chapelle and Li, 2012). This heuristic, known as Thompson sampling, fulfills the principle of “probability matching,” which states that an arm is chosen with the probability that it is the optimal one. A generic description is given in Algorithm 1, where the algorithm maintains a posterior distribution P (θ|D) over a parameter space Θ that defines a set of greedy policies. At every step, a random model θt is drawn from the posterior, and the greedy action according to the payoff predictions of θt is chosen.