Combinatorial Bandits with Relative Feedback

We consider combinatorial online learning with subset choices when only relative feedback information from subsets is available, instead of bandit or semi-bandit feedback which is absolute. Specifically, we study two regret minimisation problems over subsets of a finite ground set $[n]$, with subset-wise relative preference information feedback according to the Multinomial logit choice model. In the first setting, the learner can play subsets of size bounded by a maximum size and receives top-$m$ rank-ordered feedback, while in the second setting the learner can play subsets of a fixed size $k$ with a full subset ranking observed as feedback. For both settings, we devise instance-dependent and order-optimal regret algorithms with regret $O(\frac{n}{m} \ln T)$ and $O(\frac{n}{k} \ln T)$, respectively. We derive fundamental limits on the regret performance of online learning with subset-wise preferences, proving the tightness of our regret guarantees. Our results also show the value of eliciting more general top-$m$ rank-ordered feedback over single winner feedback ($m=1$). Our theoretical results are corroborated with empirical evaluations.

[1]  S. Dragomir,et al.  Bounds for Kullback-Leibler divergence , 2016 .

[2]  Thorsten Joachims,et al.  Beat the Mean Bandit , 2011, ICML.

[3]  Jon A. Krosnick,et al.  The Measurement of Values in Surveys: A Comparison of Ratings and Rankings , 1985 .

[4]  Alexandre Proutière,et al.  Combinatorial Bandits Revisited , 2015, NIPS.

[5]  Bruce E. Hajek,et al.  Minimax-optimal Inference from Partial Rankings , 2014, NIPS.

[6]  Wei Chen,et al.  Combinatorial Multi-Armed Bandit: General Framework and Applications , 2013, ICML.

[7]  Ravi Kumar,et al.  On the Relevance of Irrelevant Alternatives , 2016, WWW.

[8]  Aurélien Garivier,et al.  On the Complexity of Best-Arm Identification in Multi-Armed Bandit Models , 2014, J. Mach. Learn. Res..

[9]  Vashist Avadhanula,et al.  MNL-Bandit: A Dynamic Learning Approach to Assortment Selection , 2017, Oper. Res..

[10]  Shipra Agrawal,et al.  Analysis of Thompson Sampling for the Multi-armed Bandit Problem , 2011, COLT.

[11]  Thorsten Joachims,et al.  The K-armed Dueling Bandits Problem , 2012, COLT.

[12]  Thore Graepel,et al.  Ranking and Matchmaking , 2006 .

[13]  H. Robbins,et al.  Asymptotically efficient adaptive allocation rules , 1985 .

[14]  Yuxin Chen,et al.  Spectral MLE: Top-K Rank Aggregation from Pairwise Comparisons , 2015, ICML.

[15]  Eyke Hüllermeier,et al.  Preference-Based Rank Elicitation using Statistical Models: The Case of Mallows , 2014, ICML.

[16]  Csaba Szepesvári,et al.  Minimax Regret of Finite Partial-Monitoring Games in Stochastic Environments , 2011, COLT.

[17]  Katja Hofmann,et al.  Fast and reliable online learning to rank for information retrieval , 2013, SIGIR Forum.

[18]  Xi Chen,et al.  A Nearly Instance Optimal Algorithm for Top-k Ranking under the Multinomial Logit Model , 2017, SODA.

[19]  Zheng Wen,et al.  DCM Bandits: Learning to Rank with Multiple Clicks , 2016, ICML.

[20]  Filip Radlinski,et al.  How does clickthrough data reflect retrieval quality? , 2008, CIKM '08.

[21]  Minje Jang,et al.  Optimal Sample Complexity of M-wise Data for Top-K Ranking , 2017, NIPS.

[22]  M. de Rijke,et al.  Relative Upper Confidence Bound for the K-Armed Dueling Bandit Problem , 2013, ICML.

[23]  Eyke Hüllermeier,et al.  Online Rank Elicitation for Plackett-Luce: A Dueling Bandits Approach , 2015, NIPS.

[24]  Zheng Wen,et al.  Tight Regret Bounds for Stochastic Combinatorial Semi-Bandits , 2014, AISTATS.

[25]  Ness B. Shroff,et al.  PAC Ranking from Pairwise and Listwise Queries: Lower Bounds and Upper Bounds , 2018, ArXiv.

[26]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[27]  M. Ben-Akiva,et al.  Combining revealed and stated preferences data , 1994 .

[28]  David C. Parkes,et al.  Computing Parametric Ranking Models via Rank-Breaking , 2014, ICML.

[29]  Vashist Avadhanula,et al.  Thompson Sampling for the MNL-Bandit , 2017, COLT.

[30]  Vashist Avadhanula,et al.  A Near-Optimal Exploration-Exploitation Approach for Assortment Selection , 2016, EC.

[31]  Aditya Gopalan,et al.  Battle of Bandits , 2018, UAI.

[32]  Huasen Wu,et al.  Double Thompson Sampling for Dueling Bandits , 2016, NIPS.

[33]  Thorsten Joachims,et al.  Reducing Dueling Bandits to Cardinal Bandits , 2014, ICML.

[34]  D. Hensher Stated preference analysis of travel choices: the state of practice , 1994 .

[35]  Hiroshi Nakagawa,et al.  Regret Lower Bound and Optimal Algorithm in Dueling Bandit Problem , 2015, COLT.

[36]  Thorsten Joachims,et al.  Interactively optimizing information retrieval systems as a dueling bandits problem , 2009, ICML '09.

[37]  Aurélien Garivier,et al.  Explore First, Exploit Next: The True Shape of Regret in Bandit Problems , 2016, Math. Oper. Res..

[38]  Aditya Gopalan,et al.  PAC Battling Bandits in the Plackett-Luce Model , 2018, ALT.

[39]  Ashish Khetan,et al.  Data-driven Rank Breaking for Efficient Rank Aggregation , 2016, J. Mach. Learn. Res..

[40]  Nicolò Cesa-Bianchi,et al.  Combinatorial Bandits , 2012, COLT.

[41]  Eyke Hüllermeier,et al.  A Survey of Preference-Based Online Learning with Bandit Algorithms , 2014, ALT.

[42]  Joel W. Burdick,et al.  Multi-dueling Bandits with Dependent Arms , 2017, UAI.

[43]  Raphaël Féraud,et al.  Generic Exploration and K-armed Voting Bandits , 2013, ICML.

[44]  Ingemar J. Cox,et al.  Multi-Dueling Bandits and Their Application to Online Ranker Evaluation , 2016, CIKM.

[45]  Paul N. Bennett,et al.  Pairwise ranking aggregation in a crowdsourced setting , 2013, WSDM.

[46]  David C. Parkes,et al.  Random Utility Theory for Social Choice , 2012, NIPS.