A Relative Exponential Weighing Algorithm for Adversarial Utility-based Dueling Bandits

We study the K-armed dueling bandit problem which is a variation of the classical Multi-Armed Bandit (MAB) problem in which the learner receives only relative feedback about the selected pairs of arms. We propose an efficient algorithm called Relative Exponential-weight algorithm for Exploration and Exploitation (REX3) to handle the adversarial utility-based formulation of this problem. We prove a finite time expected regret upper bound of order O(√Kln(K)T) for this algorithm and a general lower bound of order Ω(√KT). At the end, we provide experimental results using real data from information retrieval applications.

[1]  Filip Radlinski,et al.  Active exploration for learning rankings from clickthrough data , 2007, KDD '07.

[2]  Eyke Hüllermeier,et al.  Preference Learning , 2005, Künstliche Intell..

[3]  Eyke Hllermeier,et al.  Preference Learning , 2010 .

[4]  Richard M. Karp,et al.  Noisy binary search and its applications , 2007, SODA '07.

[5]  Eyke Hüllermeier,et al.  Preference-Based Rank Elicitation using Statistical Models: The Case of Mallows , 2014, ICML.

[6]  Irène Charon,et al.  An updated survey on the linear ordering problem for weighted or unweighted tournaments , 2010, Ann. Oper. Res..

[7]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[8]  Eyke Hüllermeier,et al.  Top-k Selection based on Adaptive Sampling of Noisy Preferences , 2013, ICML.

[9]  M. de Rijke,et al.  Relative confidence sampling for efficient on-line ranker evaluation , 2014, WSDM.

[10]  Yoram Singer,et al.  An Efficient Boosting Algorithm for Combining Preferences by , 2013 .

[11]  Thorsten Joachims,et al.  Reducing Dueling Bandits to Cardinal Bandits , 2014, ICML.

[12]  Filip Radlinski,et al.  Evaluating the accuracy of implicit feedback from clicks and query reformulations in Web search , 2007, TOIS.

[13]  Tie-Yan Liu,et al.  Learning to rank for information retrieval , 2009, SIGIR.

[14]  Christian Schindelhauer,et al.  Discrete Prediction Games with Arbitrary Feedback and Loss , 2001, COLT/EuroCOLT.

[15]  Tao Qin,et al.  LETOR: Benchmark Dataset for Research on Learning to Rank for Information Retrieval , 2007 .

[16]  Thorsten Joachims,et al.  The K-armed Dueling Bandits Problem , 2012, COLT.

[17]  Sébastien Bubeck,et al.  Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems , 2012, Found. Trends Mach. Learn..

[18]  Csaba Szepesvári,et al.  Partial Monitoring - Classification, Regret Bounds, and Algorithms , 2014, Math. Oper. Res..

[19]  Thorsten Joachims,et al.  Interactively optimizing information retrieval systems as a dueling bandits problem , 2009, ICML '09.

[20]  M. de Rijke,et al.  MergeRUCB: A Method for Large-Scale Online Ranker Evaluation , 2015, WSDM.

[21]  Filip Radlinski,et al.  Large-scale validation and analysis of interleaved search evaluation , 2012, TOIS.

[22]  Eyke Hüllermeier,et al.  A Survey of Preference-Based Online Learning with Bandit Algorithms , 2014, ALT.

[23]  Thorsten Joachims,et al.  Beat the Mean Bandit , 2011, ICML.

[24]  Peter Auer,et al.  Evaluation and Analysis of the Performance of the EXP3 Algorithm in Stochastic Environments , 2013, EWRL.

[25]  Y. Freund,et al.  Adaptive game playing using multiplicative weights , 1999 .

[26]  Gábor Lugosi,et al.  Prediction, learning, and games , 2006 .

[27]  M. de Rijke,et al.  Relative Upper Confidence Bound for the K-Armed Dueling Bandit Problem , 2013, ICML.

[28]  Peter Auer,et al.  The Nonstochastic Multiarmed Bandit Problem , 2002, SIAM J. Comput..

[29]  Raphaël Féraud,et al.  Generic Exploration and K-armed Voting Bandits , 2013, ICML.

[30]  Gábor Bartók,et al.  A near-optimal algorithm for finite partial-monitoring games against adversarial opponents , 2013, COLT.

[31]  Nicolò Cesa-Bianchi,et al.  Combinatorial Bandits , 2012, COLT.