论文信息 - Tight Regret Bounds for Stochastic Combinatorial Semi-Bandits - 字舞流文

Tight Regret Bounds for Stochastic Combinatorial Semi-Bandits

A stochastic combinatorial semi-bandit is an online learning problem where at each step a learning agent chooses a subset of ground items subject to constraints, and then observes stochastic weights of these items and receives their sum as a payoff. In this paper, we close the problem of computationally and sample efficient learning in stochastic combinatorial semi-bandits. In particular, we analyze a UCB-like algorithm for solving the problem, which is known to be computationally efficient; and prove $O(K L (1 / \Delta) \log n)$ and $O(\sqrt{K L n \log n})$ upper bounds on its $n$-step regret, where $L$ is the number of ground items, $K$ is the maximum number of chosen items, and $\Delta$ is the gap between the expected returns of the optimal and best suboptimal solutions. The gap-dependent bound is tight up to a constant factor and the gap-free bound is tight up to a polylogarithmic factor.

Zheng Wen | Csaba Szepesvári | Branislav Kveton | Azin Ashkan | Csaba Szepesvari | B. Kveton | Zheng Wen | Azin Ashkan

[1] W. R. Thompson. ON THE LIKELIHOOD THAT ONE UNKNOWN PROBABILITY EXCEEDS ANOTHER IN VIEW OF THE EVIDENCE OF TWO SAMPLES , 1933 .

[2] M. R. Rao,et al. Combinatorial Optimization , 1992, NATO ASI Series.

[3] William J. Cook,et al. Combinatorial optimization , 1997 .

[4] Peter Auer,et al. Using Confidence Bounds for Exploitation-Exploration Trade-offs , 2003, J. Mach. Learn. Res..

[5] Peter Auer,et al. The Nonstochastic Multiarmed Bandit Problem , 2002, SIAM J. Comput..

[6] Peter Auer,et al. Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[7] Jean-Yves Audibert,et al. Minimax Policies for Adversarial and Stochastic Bandits. , 2009, COLT 2009.

[8] Nicolò Cesa-Bianchi,et al. Combinatorial Bandits , 2012, COLT.

[9] Csaba Szepesvári,et al. Improved Algorithms for Linear Stochastic Bandits , 2011, NIPS.

[10] Sébastien Bubeck,et al. Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems , 2012, Found. Trends Mach. Learn..

[11] Bhaskar Krishnamachari,et al. Combinatorial Network Optimization With Unknown Variables: Multi-Armed Bandits With Linear Rewards and Individual Observations , 2010, IEEE/ACM Transactions on Networking.

[12] Shipra Agrawal,et al. Analysis of Thompson Sampling for the Multi-armed Bandit Problem , 2011, COLT.

[13] Gergely Neu,et al. An Efficient Algorithm for Learning with Semi-bandit Feedback , 2013, ALT.

[14] Wei Chen,et al. Combinatorial multi-armed bandit: general framework, results and applications , 2013, ICML 2013.

[15] Gábor Lugosi,et al. Regret in Online Combinatorial Optimization , 2012, Math. Oper. Res..

[16] Zheng Wen,et al. Learning to Act Greedily: Polymatroid Semi-Bandits , 2014, ArXiv.

[17] Zheng Wen,et al. Matroid Bandits: Fast Combinatorial Optimization with Learning , 2014, UAI.

[18] Zheng Wen,et al. Matroid Bandits: Practical Large-Scale Combinatorial Bandits , 2014, AAAI 2014.

[19] Branislav Kveton,et al. Efficient Learning in Large-Scale Combinatorial Semi-Bandits , 2014, ICML.

[20] Benjamin Van Roy,et al. An Information-Theoretic Analysis of Thompson Sampling , 2014, J. Mach. Learn. Res..

[21] T. L. Lai Andherbertrobbins. Asymptotically Efficient Adaptive Allocation Rules , 2022 .