Incentivizing Exploration in Linear Bandits under Information Gap

We study the problem of incentivizing exploration for myopic users in linear bandits, where the users tend to exploit arm with the highest predicted reward instead of exploring. In order to maximize the long-term reward, the system offers compensation to incentivize the users to pull the exploratory arms, with the goal of balancing the trade-off among exploitation, exploration and compensation. We consider a new and practically motivated setting where the context features observed by the user are more informative than those used by the system, e.g., features based on users’ private information are not accessible by the system. We propose a new method to incentivize exploration under such information gap, and prove that the method achieves both sublinear regret and sublinear compensation. We theoretical and empirically analyze the added compensation due to the information gap, compared with the case that the system has access to the same context features as the user, i.e., without information gap. We also provide a compensation lower bound of our problem.

[1]  Lihong Li,et al.  An Empirical Evaluation of Thompson Sampling , 2011, NIPS.

[2]  Zhiyuan Liu,et al.  Incentivized Exploration for Multi-Armed Bandits under Reward Drift , 2020, AAAI.

[3]  Wei Chu,et al.  A contextual-bandit approach to personalized news article recommendation , 2010, WWW '10.

[4]  T. L. Lai Andherbertrobbins Asymptotically Efficient Adaptive Allocation Rules , 2022 .

[5]  Alessandro Lazaric,et al.  Linear Thompson Sampling Revisited , 2016, AISTATS.

[6]  Theja Tulabandhula,et al.  Incentivising Exploration and Recommendations for Contextual Bandits with Payments , 2020, EUMAS/AT.

[7]  Aleksandrs Slivkins Incentivizing exploration via information asymmetry , 2017, XRDS.

[8]  Tor Lattimore,et al.  The End of Optimism? An Asymptotic Analysis of Finite-Armed Linear Bandits , 2016, AISTATS.

[9]  Babak Hassibi,et al.  Stochastic Linear Bandits with Hidden Low Rank Structure , 2019, ArXiv.

[10]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[11]  Aleksandrs Slivkins,et al.  Sample Complexity of Incentivized Exploration , 2020, ArXiv.

[12]  Jon M. Kleinberg,et al.  Incentivizing exploration , 2014, EC.

[13]  Sampath Kannan,et al.  Fairness Incentives for Myopic Agents , 2017, EC.

[14]  Andreas Krause,et al.  Learning User Preferences to Incentivize Exploration in the Sharing Economy , 2017, AAAI.

[15]  Shipra Agrawal,et al.  Thompson Sampling for Contextual Bandits with Linear Payoffs , 2012, ICML.

[16]  Bangrui Chen,et al.  Incentivizing Exploration by Heterogeneous Users , 2018, COLT.

[17]  Csaba Szepesvári,et al.  Improved Algorithms for Linear Stochastic Bandits , 2011, NIPS.

[18]  Peter Auer,et al.  Using Confidence Bounds for Exploitation-Exploration Trade-offs , 2003, J. Mach. Learn. Res..

[19]  Yishay Mansour,et al.  Implementing the “Wisdom of the Crowd” , 2013, Journal of Political Economy.

[20]  Simon S. Du,et al.  Impact of Representation Learning in Linear Bandits , 2020, ICLR.

[21]  Nicole Immorlica,et al.  Incentivizing Exploration with Selective Data Disclosure , 2018, EC.

[22]  Wei Chu,et al.  Contextual Bandits with Linear Payoff Functions , 2011, AISTATS.

[23]  Yishay Mansour,et al.  Bayesian Incentive-Compatible Bandit Exploration , 2018 .

[24]  Siwei Wang,et al.  Multi-armed Bandits with Compensation , 2018, NeurIPS.