Statistical Efficiency of Thompson Sampling for Combinatorial Semi-Bandits

We investigate stochastic combinatorial multi-armed bandit with semi-bandit feedback (CMAB). In CMAB, the question of the existence of an efficient policy with an optimal asymptotic regret (up to a factor poly-logarithmic with the action size) is still open for many families of distributions, including mutually independent outcomes, and more generally the multivariate sub-Gaussian family. We propose to answer the above question for these two families by analyzing variants of the Combinatorial Thompson Sampling policy (CTS). For mutually independent outcomes in $[0,1]$, we propose a tight analysis of CTS using Beta priors. We then look at the more general setting of multivariate sub-Gaussian outcomes and propose a tight analysis of CTS using Gaussian priors. This last result gives us an alternative to the Efficient Sampling for Combinatorial Bandit policy (ESCB), which, although optimal, is not computationally efficient.

[1]  T. Liggett,et al.  Negative dependence and the geometry of polynomials , 2007, 0707.2340.

[2]  Ryan A. Rossi,et al.  The Network Data Repository with Interactive Graph Analytics and Visualization , 2015, AAAI.

[3]  Branislav Kveton,et al.  Efficient Learning in Large-Scale Combinatorial Semi-Bandits , 2014, ICML.

[4]  Benjamin Van Roy,et al.  An Information-Theoretic Analysis of Thompson Sampling , 2014, J. Mach. Learn. Res..

[5]  Richard Combes,et al.  Statistically Efficient, Polynomial Time Algorithms for Combinatorial Semi Bandits , 2020, ArXiv.

[6]  R. Durrett Probability: Theory and Examples , 1993 .

[7]  Shie Mannor,et al.  Thompson Sampling for Complex Bandit Problems , 2013, ArXiv.

[8]  H. Robbins Some aspects of the sequential design of experiments , 1952 .

[9]  Qing Zhao,et al.  Adaptive shortest-path routing under unknown and stochastically varying link states , 2012, 2012 10th International Symposium on Modeling and Optimization in Mobile, Ad Hoc and Wireless Networks (WiOpt).

[10]  Hiroshi Nakagawa,et al.  Optimal Regret Analysis of Thompson Sampling in Stochastic Multi-armed Bandit Problem with Multiple Plays , 2015, ICML.

[11]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[12]  Alexandre Proutière,et al.  Combinatorial Bandits Revisited , 2015, NIPS.

[13]  Alper Atamtürk,et al.  Maximizing a Class of Utility Functions Over the Vertices of a Polytope , 2017, Oper. Res..

[14]  Wei Chen,et al.  Combinatorial Multi-Armed Bandit: General Framework and Applications , 2013, ICML.

[15]  W. R. Thompson ON THE LIKELIHOOD THAT ONE UNKNOWN PROBABILITY EXCEEDS ANOTHER IN VIEW OF THE EVIDENCE OF TWO SAMPLES , 1933 .

[16]  Yajun Wang,et al.  Combinatorial Multi-Armed Bandit and Its Extension to Probabilistically Triggered Arms , 2014, J. Mach. Learn. Res..

[17]  Laurence B. Milstein,et al.  Chernoff-Type Bounds for the Gaussian Error Function , 2011, IEEE Transactions on Communications.

[18]  Wei Chen,et al.  Improving Regret Bounds for Combinatorial Semi-Bandits with Probabilistically Triggered Arms and Its Applications , 2017, NIPS.

[19]  P. W. Jones,et al.  Bandit Problems, Sequential Allocation of Experiments , 1987 .

[20]  Wei Chen,et al.  Thompson Sampling for Combinatorial Semi-Bandits , 2018, ICML.

[21]  I. M. Jacobs,et al.  Principles of Communication Engineering , 1965 .

[22]  Alexandre Proutière,et al.  An Optimal Algorithm for Stochastic Matroid Bandit Optimization , 2016, AAMAS.

[23]  Wtt Wtt Tight Regret Bounds for Stochastic Combinatorial Semi-Bandits , 2015 .

[24]  J. Walrand,et al.  Asymptotically efficient allocation rules for the multiarmed bandit problem with multiple plays-Part II: Markovian rewards , 1987 .

[25]  Yi Gai,et al.  Learning Multiuser Channel Allocations in Cognitive Radio Networks: A Combinatorial Multi-Armed Bandit Formulation , 2010, 2010 IEEE Symposium on New Frontiers in Dynamic Spectrum (DySPAN).

[26]  T. L. Lai Andherbertrobbins Asymptotically Efficient Adaptive Allocation Rules , 2022 .

[27]  W. Hoeffding Probability Inequalities for sums of Bounded Random Variables , 1963 .

[28]  Vianney Perchet,et al.  Anytime optimal algorithms in stochastic multi-armed bandits , 2016, ICML.

[29]  Shipra Agrawal,et al.  Thompson Sampling for Contextual Bandits with Linear Payoffs , 2012, ICML.

[30]  Zheng Wen,et al.  Matroid Bandits: Fast Combinatorial Optimization with Learning , 2014, UAI.

[31]  Nicolò Cesa-Bianchi,et al.  Combinatorial Bandits , 2012, COLT.

[32]  Bhaskar Krishnamachari,et al.  Combinatorial Network Optimization With Unknown Variables: Multi-Armed Bandits With Linear Rewards and Individual Observations , 2010, IEEE/ACM Transactions on Networking.

[33]  Vianney Perchet,et al.  Exploiting structure of uncertainty for efficient matroid semi-bandits , 2019, ICML.

[34]  Stochastic Orders , 2008 .

[35]  Donald A. Berry,et al.  Bandit Problems: Sequential Allocation of Experiments. , 1986 .

[36]  Rémi Munos,et al.  Thompson Sampling: An Asymptotically Optimal Finite-Time Analysis , 2012, ALT.

[37]  J. Arbel,et al.  On the sub-Gaussianity of the Beta and Dirichlet distributions , 2017, 1705.00048.

[38]  Vianney Perchet,et al.  Combinatorial semi-bandit with known covariance , 2016, NIPS.