Algorithms for slate bandits with non-separable reward functions

In this paper, we study a slate bandit problem where the function that determines the slate-level reward is non-separable: the optimal value of the function cannot be determined by learning the optimal action for each slot. We are mainly concerned with cases where the number of slates is large relative to the time horizon, so that trying each slate as a separate arm in a traditional multi-armed bandit, would not be feasible. Our main contribution is the design of algorithms that still have sub-linear regret with respect to the time horizon, despite the large number of slates. Experimental results on simulated data and real-world data show that our proposed method outperforms popular benchmark bandit algorithms.

[1]  Yajun Wang,et al.  Combinatorial Multi-Armed Bandit and Its Extension to Probabilistically Triggered Arms , 2014, J. Mach. Learn. Res..

[2]  Shuai Li,et al.  Contextual Combinatorial Cascading Bandits , 2016, ICML.

[3]  Nikos Vlassis,et al.  Marginal Posterior Sampling for Slate Bandits , 2019, IJCAI.

[4]  Branislav Kveton,et al.  Efficient Learning in Large-Scale Combinatorial Semi-Bandits , 2014, ICML.

[5]  Sébastien Bubeck,et al.  Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems , 2012, Found. Trends Mach. Learn..

[6]  Rémi Munos,et al.  From Bandits to Monte-Carlo Tree Search: The Optimistic Principle Applied to Optimization and Planning , 2014, Found. Trends Mach. Learn..

[7]  W. Hoeffding Probability Inequalities for sums of Bounded Random Variables , 1963 .

[8]  Aurélien Garivier,et al.  Optimization of a SSP's Header Bidding Strategy using Thompson Sampling , 2018, KDD.

[9]  Alexandre Proutière,et al.  Combinatorial Bandits Revisited , 2015, NIPS.

[10]  Mehryar Mohri,et al.  Learning Algorithms for Second-Price Auctions with Reserve , 2016, J. Mach. Learn. Res..

[11]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[12]  John Langford,et al.  Off-policy evaluation for slate recommendation , 2016, NIPS.

[13]  T. L. Lai Andherbertrobbins Asymptotically Efficient Adaptive Allocation Rules , 2022 .

[14]  Jun Wang,et al.  Real-Time Bidding Benchmarking with iPinYou Dataset , 2014, ArXiv.

[15]  Wtt Wtt Tight Regret Bounds for Stochastic Combinatorial Semi-Bandits , 2015 .

[16]  Xiaoyan Zhu,et al.  Contextual Combinatorial Bandit and its Application on Diversified Online Recommendation , 2014, SDM.

[17]  Wei Chen,et al.  Combinatorial Multi-Armed Bandit: General Framework and Applications , 2013, ICML.

[18]  Nicolò Cesa-Bianchi,et al.  Combinatorial Bandits , 2012, COLT.

[19]  Robert E. Schapire,et al.  Non-Stochastic Bandit Slate Problems , 2010, NIPS.

[20]  Jun Wang,et al.  Display Advertising with Real-Time Bidding (RTB) and Behavioural Targeting , 2016, Found. Trends Inf. Retr..

[21]  Jianhui Chen,et al.  Efficient Ordered Combinatorial Semi-Bandits for Whole-Page Recommendation , 2017, AAAI.

[22]  Shipra Agrawal,et al.  Further Optimal Regret Bounds for Thompson Sampling , 2012, AISTATS.