论文信息 - On the Performance of Thompson Sampling on Logistic Bandits

On the Performance of Thompson Sampling on Logistic Bandits

We study the logistic bandit, in which rewards are binary with success probability $\exp(\beta a^\top \theta) / (1 + \exp(\beta a^\top \theta))$ and actions $a$ and coefficients $\theta$ are within the $d$-dimensional unit ball. While prior regret bounds for algorithms that address the logistic bandit exhibit exponential dependence on the slope parameter $\beta$, we establish a regret bound for Thompson sampling that is independent of $\beta$. Specifically, we establish that, when the set of feasible actions is identical to the set of possible coefficient vectors, the Bayesian regret of Thompson sampling is $\tilde{O}(d\sqrt{T})$. We also establish a $\tilde{O}(\sqrt{d\eta T}/\lambda)$ bound that applies more broadly, where $\lambda$ is the worst-case optimal log-odds and $\eta$ is the "fragility dimension," a new statistic we define to capture the degree to which an optimal action for one model fails to satisfice for others. We demonstrate that the fragility dimension plays an essential role by showing that, for any $\epsilon > 0$, no algorithm can achieve $\mathrm{poly}(d, 1/\lambda)\cdot T^{1-\epsilon}$ regret.

Tengyu Ma | Shi Dong | Benjamin Van Roy | Tengyu Ma | Shi Dong

[1] W. R. Thompson. ON THE LIKELIHOOD THAT ONE UNKNOWN PROBABILITY EXCEEDS ANOTHER IN VIEW OF THE EVIDENCE OF TWO SAMPLES , 1933 .

[2] P. Erdös. On an extremal problem in graph theory , 1970 .

[3] Robert E. Tarjan,et al. Finding a Maximum Independent Set , 1976, SIAM J. Comput..

[4] K. Böröczky. Finite Packing and Covering , 2004 .

[5] Etsuji Tomita,et al. An Efficient Branch-and-bound Algorithm for Finding a Maximum Clique with Computational Experiments , 2001, J. Glob. Optim..

[6] Thomas P. Hayes,et al. Stochastic Linear Optimization under Bandit Feedback , 2008, COLT.

[7] Aurélien Garivier,et al. Parametric Bandits: The Generalized Linear Case , 2010, NIPS.

[8] Benjamin Van Roy,et al. Eluder Dimension and the Sample Complexity of Optimistic Exploration , 2013, NIPS.

[9] Benjamin Van Roy,et al. Learning to Optimize via Posterior Sampling , 2013, Math. Oper. Res..

[10] Benjamin Van Roy,et al. Learning to Optimize via Information-Directed Sampling , 2014, NIPS.

[11] Benjamin Van Roy,et al. An Information-Theoretic Analysis of Thompson Sampling , 2014, J. Mach. Learn. Res..

[12] Sébastien Bubeck,et al. Multi-scale exploration of convex functions and bandit convex optimization , 2015, COLT.

[13] Alessandro Lazaric,et al. Linear Thompson Sampling Revisited , 2016, AISTATS.

[14] Lihong Li,et al. Provable Optimal Algorithms for Generalized Linear Contextual Bandits , 2017, ArXiv.

[15] Fang Liu,et al. Information Directed Sampling for Stochastic Bandits with Graph Feedback , 2017, AAAI.

[16] Shi Dong,et al. An Information-Theoretic Analysis for Thompson Sampling with Many Actions , 2018, NeurIPS.

[17] Benjamin Van Roy,et al. Satisficing in Time-Sensitive Bandit Learning , 2018, Math. Oper. Res..