论文信息 - Generic Exploration and K-armed Voting Bandits

Generic Exploration and K-armed Voting Bandits

We study a stochastic online learning scheme with partial feedback where the utility of decisions is only observable through an estimation of the environment parameters. We propose a generic pure-exploration algorithm, able to cope with various utility functions from multi-armed bandits settings to dueling bandits. The primary application of this setting is to offer a natural generalization of dueling bandits for situations where the environment parameters reflect the idiosyncratic preferences of a mixed crowd.

[1] Jean-Paul Chilès,et al. Wiley Series in Probability and Statistics , 2012 .

[2] Shie Mannor,et al. Bandits with an Edge , 2011, ArXiv.

[3] W. Hoeffding. Probability Inequalities for sums of Bounded Random Variables , 1963 .

[4] Thorsten Joachims,et al. Interactively optimizing information retrieval systems as a dueling bandits problem , 2009, ICML '09.

[5] Peter Auer,et al. UCB revisited: Improved regret bounds for the stochastic multi-armed bandit problem , 2010, Period. Math. Hung..

[6] Filip Radlinski,et al. Large-scale validation and analysis of interleaved search evaluation , 2012, TOIS.

[7] Peter Auer,et al. Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[8] T. L. Lai Andherbertrobbins. Asymptotically Efficient Adaptive Allocation Rules , 1985 .

[9] Eli Upfal,et al. Computing with Noisy Information , 1994, SIAM J. Comput..

[10] Thorsten Joachims,et al. Beat the Mean Bandit , 2011, ICML.

[11] Shie Mannor,et al. Action Elimination and Stopping Conditions for the Multi-Armed Bandit and Reinforcement Learning Problems , 2006, J. Mach. Learn. Res..

[12] Thorsten Joachims,et al. The K-armed Dueling Bandits Problem , 2012, COLT.

[13] Thorsten Joachims,et al. Evaluating Retrieval Performance Using Clickthrough Data , 2003, Text Mining.

[14] Shie Mannor,et al. PAC Bounds for Multi-armed Bandit and Markov Decision Processes , 2002, COLT.

[15] Bala Ravikumar,et al. On Selecting the Largest Element in Spite of Erroneous Information , 1987, STACS.

[16] Rémi Munos,et al. Pure exploration in finitely-armed and continuous-armed bandits , 2011, Theor. Comput. Sci..

[17] Csaba Szepesvári,et al. Exploration-exploitation tradeoff using variance estimates in multi-armed bandits , 2009, Theor. Comput. Sci..

[18] Yann Chevaleyre,et al. A Short Introduction to Computational Social Choice , 2007, SOFSEM.

[19] Irène Charon,et al. An updated survey on the linear ordering problem for weighted or unweighted tournaments , 2010, Ann. Oper. Res..