论文信息 - A Relative Exponential Weighing Algorithm for Adversarial Utility-based Dueling Bandits - 字舞流文

A Relative Exponential Weighing Algorithm for Adversarial Utility-based Dueling Bandits

We study the K-armed dueling bandit problem which is a variation of the classical Multi-Armed Bandit (MAB) problem in which the learner receives only relative feedback about the selected pairs of arms. We propose an efficient algorithm called Relative Exponential-weight algorithm for Exploration and Exploitation (REX3) to handle the adversarial utility-based formulation of this problem. We prove a finite time expected regret upper bound of order O(√Kln(K)T) for this algorithm and a general lower bound of order Ω(√KT). At the end, we provide experimental results using real data from information retrieval applications.

Fabrice Clérot | Pratik Gajane | Tanguy Urvoy | Tanguy Urvoy | F. Clérot | Pratik Gajane

[1] Filip Radlinski,et al. Active exploration for learning rankings from clickthrough data , 2007, KDD '07.

[2] Eyke Hüllermeier,et al. Preference Learning , 2005, Künstliche Intell..

[3] Eyke Hllermeier,et al. Preference Learning , 2010 .

[4] Richard M. Karp,et al. Noisy binary search and its applications , 2007, SODA '07.

[5] Eyke Hüllermeier,et al. Preference-Based Rank Elicitation using Statistical Models: The Case of Mallows , 2014, ICML.

[6] Irène Charon,et al. An updated survey on the linear ordering problem for weighted or unweighted tournaments , 2010, Ann. Oper. Res..

[7] Peter Auer,et al. Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[8] Eyke Hüllermeier,et al. Top-k Selection based on Adaptive Sampling of Noisy Preferences , 2013, ICML.

[9] M. de Rijke,et al. Relative confidence sampling for efficient on-line ranker evaluation , 2014, WSDM.

[10] Yoram Singer,et al. An Efficient Boosting Algorithm for Combining Preferences by , 2013 .

[11] Thorsten Joachims,et al. Reducing Dueling Bandits to Cardinal Bandits , 2014, ICML.

[12] Filip Radlinski,et al. Evaluating the accuracy of implicit feedback from clicks and query reformulations in Web search , 2007, TOIS.

[13] Tie-Yan Liu,et al. Learning to rank for information retrieval , 2009, SIGIR.

[14] Christian Schindelhauer,et al. Discrete Prediction Games with Arbitrary Feedback and Loss , 2001, COLT/EuroCOLT.

[15] Tao Qin,et al. LETOR: Benchmark Dataset for Research on Learning to Rank for Information Retrieval , 2007 .

[16] Thorsten Joachims,et al. The K-armed Dueling Bandits Problem , 2012, COLT.

[17] Sébastien Bubeck,et al. Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems , 2012, Found. Trends Mach. Learn..

[18] Csaba Szepesvári,et al. Partial Monitoring - Classification, Regret Bounds, and Algorithms , 2014, Math. Oper. Res..

[19] Thorsten Joachims,et al. Interactively optimizing information retrieval systems as a dueling bandits problem , 2009, ICML '09.

[20] M. de Rijke,et al. MergeRUCB: A Method for Large-Scale Online Ranker Evaluation , 2015, WSDM.

[21] Filip Radlinski,et al. Large-scale validation and analysis of interleaved search evaluation , 2012, TOIS.

[22] Eyke Hüllermeier,et al. A Survey of Preference-Based Online Learning with Bandit Algorithms , 2014, ALT.

[23] Thorsten Joachims,et al. Beat the Mean Bandit , 2011, ICML.

[24] Peter Auer,et al. Evaluation and Analysis of the Performance of the EXP3 Algorithm in Stochastic Environments , 2013, EWRL.

[25] Y. Freund,et al. Adaptive game playing using multiplicative weights , 1999 .

[26] Gábor Lugosi,et al. Prediction, learning, and games , 2006 .

[27] M. de Rijke,et al. Relative Upper Confidence Bound for the K-Armed Dueling Bandit Problem , 2013, ICML.

[28] Peter Auer,et al. The Nonstochastic Multiarmed Bandit Problem , 2002, SIAM J. Comput..

[29] Raphaël Féraud,et al. Generic Exploration and K-armed Voting Bandits , 2013, ICML.

[30] Gábor Bartók,et al. A near-optimal algorithm for finite partial-monitoring games against adversarial opponents , 2013, COLT.

[31] Nicolò Cesa-Bianchi,et al. Combinatorial Bandits , 2012, COLT.