论文信息 - Combinatorial Bandits with Relative Feedback - 字舞流文

Combinatorial Bandits with Relative Feedback

We consider combinatorial online learning with subset choices when only relative feedback information from subsets is available, instead of bandit or semi-bandit feedback which is absolute. Specifically, we study two regret minimisation problems over subsets of a finite ground set $[n]$, with subset-wise relative preference information feedback according to the Multinomial logit choice model. In the first setting, the learner can play subsets of size bounded by a maximum size and receives top-$m$ rank-ordered feedback, while in the second setting the learner can play subsets of a fixed size $k$ with a full subset ranking observed as feedback. For both settings, we devise instance-dependent and order-optimal regret algorithms with regret $O(\frac{n}{m} \ln T)$ and $O(\frac{n}{k} \ln T)$, respectively. We derive fundamental limits on the regret performance of online learning with subset-wise preferences, proving the tightness of our regret guarantees. Our results also show the value of eliciting more general top-$m$ rank-ordered feedback over single winner feedback ($m=1$). Our theoretical results are corroborated with empirical evaluations.

Aditya Gopalan | Aadirupa Saha

[1] S. Dragomir,et al. Bounds for Kullback-Leibler divergence , 2016 .

[2] Thorsten Joachims,et al. Beat the Mean Bandit , 2011, ICML.

[3] Jon A. Krosnick,et al. The Measurement of Values in Surveys: A Comparison of Ratings and Rankings , 1985 .

[4] Alexandre Proutière,et al. Combinatorial Bandits Revisited , 2015, NIPS.

[5] Bruce E. Hajek,et al. Minimax-optimal Inference from Partial Rankings , 2014, NIPS.

[6] Wei Chen,et al. Combinatorial Multi-Armed Bandit: General Framework and Applications , 2013, ICML.

[7] Ravi Kumar,et al. On the Relevance of Irrelevant Alternatives , 2016, WWW.

[8] Aurélien Garivier,et al. On the Complexity of Best-Arm Identification in Multi-Armed Bandit Models , 2014, J. Mach. Learn. Res..

[9] Vashist Avadhanula,et al. MNL-Bandit: A Dynamic Learning Approach to Assortment Selection , 2017, Oper. Res..

[10] Shipra Agrawal,et al. Analysis of Thompson Sampling for the Multi-armed Bandit Problem , 2011, COLT.

[11] Thorsten Joachims,et al. The K-armed Dueling Bandits Problem , 2012, COLT.

[12] Thore Graepel,et al. Ranking and Matchmaking , 2006 .

[13] H. Robbins,et al. Asymptotically efficient adaptive allocation rules , 1985 .

[14] Yuxin Chen,et al. Spectral MLE: Top-K Rank Aggregation from Pairwise Comparisons , 2015, ICML.

[15] Eyke Hüllermeier,et al. Preference-Based Rank Elicitation using Statistical Models: The Case of Mallows , 2014, ICML.

[16] Csaba Szepesvári,et al. Minimax Regret of Finite Partial-Monitoring Games in Stochastic Environments , 2011, COLT.

[17] Katja Hofmann,et al. Fast and reliable online learning to rank for information retrieval , 2013, SIGIR Forum.

[18] Xi Chen,et al. A Nearly Instance Optimal Algorithm for Top-k Ranking under the Multinomial Logit Model , 2017, SODA.

[19] Zheng Wen,et al. DCM Bandits: Learning to Rank with Multiple Clicks , 2016, ICML.

[20] Filip Radlinski,et al. How does clickthrough data reflect retrieval quality? , 2008, CIKM '08.

[21] Minje Jang,et al. Optimal Sample Complexity of M-wise Data for Top-K Ranking , 2017, NIPS.

[22] M. de Rijke,et al. Relative Upper Confidence Bound for the K-Armed Dueling Bandit Problem , 2013, ICML.

[23] Eyke Hüllermeier,et al. Online Rank Elicitation for Plackett-Luce: A Dueling Bandits Approach , 2015, NIPS.

[24] Zheng Wen,et al. Tight Regret Bounds for Stochastic Combinatorial Semi-Bandits , 2014, AISTATS.

[25] Ness B. Shroff,et al. PAC Ranking from Pairwise and Listwise Queries: Lower Bounds and Upper Bounds , 2018, ArXiv.

[26] Thomas M. Cover,et al. Elements of Information Theory , 2005 .

[27] M. Ben-Akiva,et al. Combining revealed and stated preferences data , 1994 .

[28] David C. Parkes,et al. Computing Parametric Ranking Models via Rank-Breaking , 2014, ICML.

[29] Vashist Avadhanula,et al. Thompson Sampling for the MNL-Bandit , 2017, COLT.

[30] Vashist Avadhanula,et al. A Near-Optimal Exploration-Exploitation Approach for Assortment Selection , 2016, EC.

[31] Aditya Gopalan,et al. Battle of Bandits , 2018, UAI.

[32] Huasen Wu,et al. Double Thompson Sampling for Dueling Bandits , 2016, NIPS.

[33] Thorsten Joachims,et al. Reducing Dueling Bandits to Cardinal Bandits , 2014, ICML.

[34] D. Hensher. Stated preference analysis of travel choices: the state of practice , 1994 .

[35] Hiroshi Nakagawa,et al. Regret Lower Bound and Optimal Algorithm in Dueling Bandit Problem , 2015, COLT.

[36] Thorsten Joachims,et al. Interactively optimizing information retrieval systems as a dueling bandits problem , 2009, ICML '09.

[37] Aurélien Garivier,et al. Explore First, Exploit Next: The True Shape of Regret in Bandit Problems , 2016, Math. Oper. Res..

[38] Aditya Gopalan,et al. PAC Battling Bandits in the Plackett-Luce Model , 2018, ALT.

[39] Ashish Khetan,et al. Data-driven Rank Breaking for Efficient Rank Aggregation , 2016, J. Mach. Learn. Res..

[40] Nicolò Cesa-Bianchi,et al. Combinatorial Bandits , 2012, COLT.

[41] Eyke Hüllermeier,et al. A Survey of Preference-Based Online Learning with Bandit Algorithms , 2014, ALT.

[42] Joel W. Burdick,et al. Multi-dueling Bandits with Dependent Arms , 2017, UAI.

[43] Raphaël Féraud,et al. Generic Exploration and K-armed Voting Bandits , 2013, ICML.

[44] Ingemar J. Cox,et al. Multi-Dueling Bandits and Their Application to Online Ranker Evaluation , 2016, CIKM.

[45] Paul N. Bennett,et al. Pairwise ranking aggregation in a crowdsourced setting , 2013, WSDM.

[46] David C. Parkes,et al. Random Utility Theory for Social Choice , 2012, NIPS.