Dueling Bandits: From Two-dueling to Multi-dueling

We study a general multi-dueling bandit problem, where an agent compares multiple options simultaneously and aims to minimize the regret due to selecting suboptimal arms. This setting generalizes the traditional two-dueling bandit problem and inds many real-world applications involving subjective feedback on multiple options. We start with the two-dueling bandit setting and propose two eicient algorithms, DoublerBAI andMultiSBM-Feedback. DoublerBAI provides a generic schema for translating known results on best arm identiication algorithms to the dueling bandit problem, and achieves a regret bound of O(lnT ). MultiSBM-Feedback not only has an optimal O(lnT ) regret, but also reduces the constant factor by almost a half compared to benchmark results. Then, we consider the general multi-dueling case and develop an eicient algorithm MultiRUCB. Using a novel inite-time regret analysis for the general multi-dueling bandit problem, we show that MultiRUCB also achieves an O(lnT ) regret bound and the bound tightens as the capacity of the comparison set increases. Based on both synthetic and real-world datasets, we empirically demonstrate that our algorithms outperform existing algorithms.

[1]  Joel W. Burdick,et al.  Clinical online recommendation with subgroup rank feedback , 2014, RecSys '14.

[2]  Feller William,et al.  An Introduction To Probability Theory And Its Applications , 1950 .

[3]  M. de Rijke,et al.  Copeland Dueling Bandits , 2015, NIPS.

[4]  Filip Radlinski,et al.  Mortal Multi-Armed Bandits , 2008, NIPS.

[5]  Ingemar J. Cox,et al.  Multi-Dueling Bandits and Their Application to Online Ranker Evaluation , 2016, CIKM.

[6]  Thorsten Joachims,et al.  Reducing Dueling Bandits to Cardinal Bandits , 2014, ICML.

[7]  Ingemar J. Cox,et al.  An Improved Multileaving Algorithm for Online Ranker Evaluation , 2016, SIGIR.

[8]  Kojiro Iizuka,et al.  Greedy optimized multileaving for personalization , 2019, RecSys.

[9]  Pushmeet Kohli,et al.  A Fast Bandit Algorithm for Recommendation to Users With Heterogenous Tastes , 2013, AAAI.

[10]  Peter Auer,et al.  UCB revisited: Improved regret bounds for the stochastic multi-armed bandit problem , 2010, Period. Math. Hung..

[11]  Joel W. Burdick,et al.  Multi-dueling Bandits with Dependent Arms , 2017, UAI.

[12]  Shipra Agrawal,et al.  Analysis of Thompson Sampling for the Multi-armed Bandit Problem , 2011, COLT.

[13]  Katja Hofmann,et al.  Information Retrieval manuscript No. (will be inserted by the editor) Balancing Exploration and Exploitation in Listwise and Pairwise Online Learning to Rank for Information Retrieval , 2022 .

[14]  Ambuj Tewari,et al.  PAC Subset Selection in Stochastic Multi-armed Bandits , 2012, ICML.

[15]  Thorsten Joachims,et al.  Beat the Mean Bandit , 2011, ICML.

[16]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[17]  W. R. Thompson ON THE LIKELIHOOD THAT ONE UNKNOWN PROBABILITY EXCEEDS ANOTHER IN VIEW OF THE EVIDENCE OF TWO SAMPLES , 1933 .

[18]  Lilian Besson,et al.  What Doubling Tricks Can and Can't Do for Multi-Armed Bandits , 2018, ArXiv.

[19]  Maarten de Rijke,et al.  Sensitive and Scalable Online Evaluation with Theoretical Guarantees , 2017, CIKM.

[20]  Huazheng Wang,et al.  Efficient Exploration of Gradient Space for Online Learning to Rank , 2018, SIGIR.

[21]  Tao Qin,et al.  Introducing LETOR 4.0 Datasets , 2013, ArXiv.

[22]  Thorsten Joachims,et al.  The K-armed Dueling Bandits Problem , 2012, COLT.

[23]  Filip Radlinski,et al.  Predicting Search Satisfaction Metrics with Interleaved Comparisons , 2015, SIGIR.

[24]  Stephen E. Robertson,et al.  Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval , 1994, SIGIR '94.

[25]  M. de Rijke,et al.  Relative Upper Confidence Bound for the K-Armed Dueling Bandit Problem , 2013, ICML.

[26]  Raphaël Féraud,et al.  Generic Exploration and K-armed Voting Bandits , 2013, ICML.

[27]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[28]  Azadeh Shakery,et al.  Online Learning to Rank for Cross-Language Information Retrieval , 2017, SIGIR.