Incentivized Exploration for Multi-Armed Bandits under Reward Drift

We study incentivized exploration for the multi-armed bandit (MAB) problem where the players receive compensation for exploring arms other than the greedy choice and may provide biased feedback on reward. We seek to understand the impact of this drifted reward feedback by analyzing the performance of three instantiations of the incentivized MAB algorithm: UCB, $\varepsilon$-Greedy, and Thompson Sampling. Our results show that they all achieve $\mathcal{O}(\log T)$ regret and compensation under the drifted reward, and are therefore effective in incentivizing exploration. Numerical examples are provided to complement the theoretical analysis.

[1]  Andreas Krause,et al.  Learning User Preferences to Incentivize Exploration in the Sharing Economy , 2017, AAAI.

[2]  Sébastien Bubeck,et al.  Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems , 2012, Found. Trends Mach. Learn..

[3]  N. Nirwanto,et al.  The Impact of Product Quality and Price on Customer Satisfaction with the Mediator of Customer Value , 2016 .

[4]  Wei Chu,et al.  A contextual-bandit approach to personalized news article recommendation , 2010, WWW '10.

[5]  Jon M. Kleinberg,et al.  Incentivizing exploration , 2014, EC.

[6]  Li Han,et al.  Incentivizing Exploration with Heterogeneous Value of Money , 2015, WINE.

[7]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[8]  Yang Liu,et al.  Incentivizing High Quality User Contributions: New Arm Generation in Bandit Learning , 2018, AAAI.

[9]  Yi Gai,et al.  Learning Multiuser Channel Allocations in Cognitive Radio Networks: A Combinatorial Multi-Armed Bandit Formulation , 2010, 2010 IEEE Symposium on New Frontiers in Dynamic Spectrum (DySPAN).

[10]  Shipra Agrawal,et al.  Near-Optimal Regret Bounds for Thompson Sampling , 2017, J. ACM.

[11]  Nicole Immorlica,et al.  Bayesian Exploration with Heterogeneous Agents , 2019, WWW.

[12]  Lihong Li,et al.  An Empirical Evaluation of Thompson Sampling , 2011, NIPS.

[13]  W. Hoeffding Probability Inequalities for sums of Bounded Random Variables , 1963 .

[14]  Nicole Immorlica,et al.  Incentivizing Exploration with Unbiased Histories , 2018, ArXiv.

[15]  Alda Lopes Gançarski,et al.  A Contextual-Bandit Algorithm for Mobile Context-Aware Recommender System , 2012, ICONIP.

[16]  Vazifehdust Housein,et al.  CUSTOMER PERCEPTIONS OF E-SERVICE QUALITY IN ONLINE SHOPPING , 2012 .

[17]  D. Owen Handbook of Mathematical Functions with Formulas , 1965 .

[18]  Siwei Wang,et al.  Multi-armed Bandits with Compensation , 2018, NeurIPS.

[19]  K. Kristensen,et al.  The drivers of customer satisfaction and loyalty: Cross-industry findings from Denmark , 2000 .

[20]  T. L. Lai Andherbertrobbins Asymptotically Efficient Adaptive Allocation Rules , 2022 .

[21]  John Langford,et al.  Taming the Monster: A Fast and Simple Algorithm for Contextual Bandits , 2014, ICML.

[22]  Renato Paes Leme,et al.  Stochastic bandits robust to adversarial corruptions , 2018, STOC.

[23]  Zahra Ehsani,et al.  Effect of Quality and Price on Customer Satisfaction and Commitment in Iran Auto Industry , 2015 .

[24]  Lihong Li,et al.  Adversarial Attacks on Stochastic Bandits , 2018, NeurIPS.

[25]  Haifeng Xu,et al.  The Intrinsic Robustness of Stochastic Bandits to Strategic Manipulation , 2019, ICML.

[26]  Filip Radlinski,et al.  Learning diverse rankings with multi-armed bandits , 2008, ICML '08.

[27]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[28]  Gwo-Guang Lee,et al.  Customer perceptions of e‐service quality in online shopping , 2005 .

[29]  Thorsten Joachims,et al.  Interactively optimizing information retrieval systems as a dueling bandits problem , 2009, ICML '09.

[30]  Mehryar Mohri,et al.  Multi-armed Bandit Algorithms and Empirical Evaluation , 2005, ECML.

[31]  Shipra Agrawal,et al.  Further Optimal Regret Bounds for Thompson Sampling , 2012, AISTATS.