论文信息 - Regret Bounds for Stochastic Combinatorial Multi-Armed Bandits with Linear Space Complexity - 字舞流文

Regret Bounds for Stochastic Combinatorial Multi-Armed Bandits with Linear Space Complexity

Many real-world problems face the dilemma of choosing best $K$ out of $N$ options at a given time instant. This setup can be modelled as combinatorial bandit which chooses $K$ out of $N$ arms at each time, with an aim to achieve an efficient tradeoff between exploration and exploitation. This is the first work for combinatorial bandit where the reward received can be a non-linear function of the chosen $K$ arms. The direct use of multi-armed bandit requires choosing among $N$-choose-$K$ options making the state space large. In this paper, we present a novel algorithm which is computationally efficient and the storage is linear in $N$. The proposed algorithm is a divide-and-conquer based strategy, that we call CMAB-SM. Further, the proposed algorithm achieves a regret bound of $\tilde O(K^\frac{1}{2}N^\frac{1}{3}T^\frac{2}{3})$ for a time horizon $T$, which is sub-linear in all parameters $T$, $N$, and $K$. The evaluation results on different reward functions and arm distribution functions show significantly improved performance as compared to standard multi-armed bandit approach with $\binom{N}{K}$ choices.

Vaneet Aggarwal | Mridul Agarwal | V. Aggarwal | Mridul Agarwal

[1] Tie-Yan Liu,et al. Joint optimization of bid and budget allocation in sponsored search , 2012, KDD.

[2] Peter Auer,et al. Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[3] Josef Hadar,et al. Rules for Ordering Uncertain Prospects , 1969 .

[4] H. Robbins,et al. Asymptotically efficient adaptive allocation rules , 1985 .

[5] Jean-Yves Audibert,et al. Minimax Policies for Adversarial and Stochastic Bandits. , 2009, COLT 2009.

[6] M. de Rijke,et al. BubbleRank: Safe Online Learning to Rerank , 2018, ArXiv.

[7] Nicolò Cesa-Bianchi,et al. Combinatorial Bandits , 2012, COLT.

[8] Peter Auer,et al. UCB revisited: Improved regret bounds for the stochastic multi-armed bandit problem , 2010, Period. Math. Hung..

[9] Alan Slomson. Introduction to Combinatorics , 1997 .

[10] Thomas P. Hayes,et al. Stochastic Linear Optimization under Bandit Feedback , 2008, COLT.

[11] V. Bawa. OPTIMAL, RULES FOR ORDERING UNCERTAIN PROSPECTS+ , 1975 .

[12] David Liau,et al. Stochastic Multi-armed Bandits in Constant Space , 2017, AISTATS.

[13] Wei Chen,et al. Combinatorial Partial Monitoring Game with Linear Feedback and Its Applications , 2014, ICML.

[14] Marcello Restelli,et al. A Combinatorial-Bandit Algorithm for the Online Joint Bid/Budget Optimization of Pay-per-Click Advertising Campaigns , 2018, AAAI.

[15] Shipra Agrawal,et al. Analysis of Thompson Sampling for the Multi-armed Bandit Problem , 2011, COLT.

[16] Zheng Wen,et al. Tight Regret Bounds for Stochastic Combinatorial Semi-Bandits , 2014, AISTATS.

[17] W. R. Thompson. ON THE LIKELIHOOD THAT ONE UNKNOWN PROBABILITY EXCEEDS ANOTHER IN VIEW OF THE EVIDENCE OF TWO SAMPLES , 1933 .

[18] Wei Chen,et al. Combinatorial Multi-Armed Bandit: General Framework and Applications , 2013, ICML.

[19] Thomas P. Hayes,et al. The Price of Bandit Information for Online Optimization , 2007, NIPS.

[20] Peter Auer,et al. The Nonstochastic Multiarmed Bandit Problem , 2002, SIAM J. Comput..

[21] Gábor Lugosi,et al. Regret in Online Combinatorial Optimization , 2012, Math. Oper. Res..

[22] O. Krafft,et al. A Note on Hoeffding's Inequality , 1969 .

[23] Zheng Wen,et al. Cascading Bandits: Learning to Rank in the Cascade Model , 2015, ICML.

[24] Yi Gai,et al. Learning Multiuser Channel Allocations in Cognitive Radio Networks: A Combinatorial Multi-Armed Bandit Formulation , 2010, 2010 IEEE Symposium on New Frontiers in Dynamic Spectrum (DySPAN).