Generic Exploration and K-armed Voting Bandits

We study a stochastic online learning scheme with partial feedback where the utility of decisions is only observable through an estimation of the environment parameters. We propose a generic pure-exploration algorithm, able to cope with various utility functions from multi-armed bandits settings to dueling bandits. The primary application of this setting is to offer a natural generalization of dueling bandits for situations where the environment parameters reflect the idiosyncratic preferences of a mixed crowd.

[1]  Jean-Paul Chilès,et al.  Wiley Series in Probability and Statistics , 2012 .

[2]  Shie Mannor,et al.  Bandits with an Edge , 2011, ArXiv.

[3]  W. Hoeffding Probability Inequalities for sums of Bounded Random Variables , 1963 .

[4]  Thorsten Joachims,et al.  Interactively optimizing information retrieval systems as a dueling bandits problem , 2009, ICML '09.

[5]  Peter Auer,et al.  UCB revisited: Improved regret bounds for the stochastic multi-armed bandit problem , 2010, Period. Math. Hung..

[6]  Filip Radlinski,et al.  Large-scale validation and analysis of interleaved search evaluation , 2012, TOIS.

[7]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[8]  T. L. Lai Andherbertrobbins Asymptotically Efficient Adaptive Allocation Rules , 1985 .

[9]  Eli Upfal,et al.  Computing with Noisy Information , 1994, SIAM J. Comput..

[10]  Thorsten Joachims,et al.  Beat the Mean Bandit , 2011, ICML.

[11]  Shie Mannor,et al.  Action Elimination and Stopping Conditions for the Multi-Armed Bandit and Reinforcement Learning Problems , 2006, J. Mach. Learn. Res..

[12]  Thorsten Joachims,et al.  The K-armed Dueling Bandits Problem , 2012, COLT.

[13]  Thorsten Joachims,et al.  Evaluating Retrieval Performance Using Clickthrough Data , 2003, Text Mining.

[14]  Shie Mannor,et al.  PAC Bounds for Multi-armed Bandit and Markov Decision Processes , 2002, COLT.

[15]  Bala Ravikumar,et al.  On Selecting the Largest Element in Spite of Erroneous Information , 1987, STACS.

[16]  Rémi Munos,et al.  Pure exploration in finitely-armed and continuous-armed bandits , 2011, Theor. Comput. Sci..

[17]  Csaba Szepesvári,et al.  Exploration-exploitation tradeoff using variance estimates in multi-armed bandits , 2009, Theor. Comput. Sci..

[18]  Yann Chevaleyre,et al.  A Short Introduction to Computational Social Choice , 2007, SOFSEM.

[19]  Irène Charon,et al.  An updated survey on the linear ordering problem for weighted or unweighted tournaments , 2010, Ann. Oper. Res..