论文信息 - Dueling Bandits: From Two-dueling to Multi-dueling - 字舞流文

Dueling Bandits: From Two-dueling to Multi-dueling

We study a general multi-dueling bandit problem, where an agent compares multiple options simultaneously and aims to minimize the regret due to selecting suboptimal arms. This setting generalizes the traditional two-dueling bandit problem and inds many real-world applications involving subjective feedback on multiple options. We start with the two-dueling bandit setting and propose two eicient algorithms, DoublerBAI andMultiSBM-Feedback. DoublerBAI provides a generic schema for translating known results on best arm identiication algorithms to the dueling bandit problem, and achieves a regret bound of O(lnT ). MultiSBM-Feedback not only has an optimal O(lnT ) regret, but also reduces the constant factor by almost a half compared to benchmark results. Then, we consider the general multi-dueling case and develop an eicient algorithm MultiRUCB. Using a novel inite-time regret analysis for the general multi-dueling bandit problem, we show that MultiRUCB also achieves an O(lnT ) regret bound and the bound tightens as the capacity of the comparison set increases. Based on both synthetic and real-world datasets, we empirically demonstrate that our algorithms outperform existing algorithms.

Longbo Huang | Yihan Du | Siwei Wang | Siwei Wang | Yihan Du | Longbo Huang

[1] Joel W. Burdick,et al. Clinical online recommendation with subgroup rank feedback , 2014, RecSys '14.

[2] Feller William,et al. An Introduction To Probability Theory And Its Applications , 1950 .

[3] M. de Rijke,et al. Copeland Dueling Bandits , 2015, NIPS.

[4] Filip Radlinski,et al. Mortal Multi-Armed Bandits , 2008, NIPS.

[5] Ingemar J. Cox,et al. Multi-Dueling Bandits and Their Application to Online Ranker Evaluation , 2016, CIKM.

[6] Thorsten Joachims,et al. Reducing Dueling Bandits to Cardinal Bandits , 2014, ICML.

[7] Ingemar J. Cox,et al. An Improved Multileaving Algorithm for Online Ranker Evaluation , 2016, SIGIR.

[8] Kojiro Iizuka,et al. Greedy optimized multileaving for personalization , 2019, RecSys.

[9] Pushmeet Kohli,et al. A Fast Bandit Algorithm for Recommendation to Users With Heterogenous Tastes , 2013, AAAI.

[10] Peter Auer,et al. UCB revisited: Improved regret bounds for the stochastic multi-armed bandit problem , 2010, Period. Math. Hung..

[11] Joel W. Burdick,et al. Multi-dueling Bandits with Dependent Arms , 2017, UAI.

[12] Shipra Agrawal,et al. Analysis of Thompson Sampling for the Multi-armed Bandit Problem , 2011, COLT.

[13] Katja Hofmann,et al. Information Retrieval manuscript No. (will be inserted by the editor) Balancing Exploration and Exploitation in Listwise and Pairwise Online Learning to Rank for Information Retrieval , 2022 .

[14] Ambuj Tewari,et al. PAC Subset Selection in Stochastic Multi-armed Bandits , 2012, ICML.

[15] Thorsten Joachims,et al. Beat the Mean Bandit , 2011, ICML.

[16] Rajeev Motwani,et al. The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[17] W. R. Thompson. ON THE LIKELIHOOD THAT ONE UNKNOWN PROBABILITY EXCEEDS ANOTHER IN VIEW OF THE EVIDENCE OF TWO SAMPLES , 1933 .

[18] Lilian Besson,et al. What Doubling Tricks Can and Can't Do for Multi-Armed Bandits , 2018, ArXiv.

[19] Maarten de Rijke,et al. Sensitive and Scalable Online Evaluation with Theoretical Guarantees , 2017, CIKM.

[20] Huazheng Wang,et al. Efficient Exploration of Gradient Space for Online Learning to Rank , 2018, SIGIR.

[21] Tao Qin,et al. Introducing LETOR 4.0 Datasets , 2013, ArXiv.

[22] Thorsten Joachims,et al. The K-armed Dueling Bandits Problem , 2012, COLT.

[23] Filip Radlinski,et al. Predicting Search Satisfaction Metrics with Interleaved Comparisons , 2015, SIGIR.

[24] Stephen E. Robertson,et al. Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval , 1994, SIGIR '94.

[25] M. de Rijke,et al. Relative Upper Confidence Bound for the K-Armed Dueling Bandit Problem , 2013, ICML.

[26] Raphaël Féraud,et al. Generic Exploration and K-armed Voting Bandits , 2013, ICML.

[27] Peter Auer,et al. Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[28] Azadeh Shakery,et al. Online Learning to Rank for Cross-Language Information Retrieval , 2017, SIGIR.