Sample Complexity of Incentivized Exploration

We consider incentivized exploration: a version of multi-armed bandits where the choice of actions is controlled by self-interested agents, and the algorithm can only issue recommendations. The algorithm controls the flow of information, and the information asymmetry can incentivize the agents to explore. Prior work matches the optimal regret rates for bandits up to "constant" multiplicative factors determined by the Bayesian prior. However, the dependence on the prior in prior work could be arbitrarily large, and the dependence on the number of arms K could be exponential. The optimal dependence on the prior and K is very unclear. We make progress on these issues. Our first result is that Thompson sampling is incentive-compatible if initialized with enough data points. Thus, we reduce the problem of designing incentive-compatible algorithms to that of sample complexity: (i) How many data points are needed to incentivize Thompson sampling? (ii) How many rounds does it take to collect these samples? We address both questions, providing upper bounds on sample complexity that are typically polynomial in K and lower bounds that are polynomially matching.

[1]  Christian M. Ernst,et al.  Multi-armed Bandit Allocation Indices , 1989 .

[2]  Rémi Munos,et al.  Thompson Sampling: An Asymptotically Optimal Finite-Time Analysis , 2012, ALT.

[3]  Moshe Babaioff,et al.  Truthful mechanisms with implicit payment computation , 2010, EC '10.

[4]  Peter Auer,et al.  The Nonstochastic Multiarmed Bandit Problem , 2002, SIAM J. Comput..

[5]  Annie Liang,et al.  Optimal and Myopic Information Acquisition , 2017, EC.

[6]  Patrick Hummel,et al.  Learning and incentives in user-generated content: multi-armed bandits with endogenous arms , 2013, ITCS '13.

[7]  Stephen Morris,et al.  Information Design, Bayesian Persuasion and Bayes Correlated Equilibrium , 2016 .

[8]  Omar Besbes,et al.  Dynamic Pricing Without Knowing the Demand Function: Risk Bounds and Near-Optimal Algorithms , 2009, Oper. Res..

[9]  Moshe Tennenholtz,et al.  Social Learning and the Innkeeper's Challenge , 2019, EC.

[10]  Sébastien Bubeck,et al.  Prior-free and prior-dependent regret bounds for Thompson Sampling , 2013, 2014 48th Annual Conference on Information Sciences and Systems (CISS).

[11]  W. R. Thompson ON THE LIKELIHOOD THAT ONE UNKNOWN PROBABILITY EXCEEDS ANOTHER IN VIEW OF THE EVIDENCE OF TWO SAMPLES , 1933 .

[12]  Frank Thomson Leighton,et al.  The value of knowing a demand curve: bounds on regret for online posted-price auctions , 2003, 44th Annual IEEE Symposium on Foundations of Computer Science, 2003. Proceedings..

[13]  Yishay Mansour,et al.  Competing Bandits: Learning Under Competition , 2017, ITCS.

[14]  Annie Liang,et al.  Overabundant Information and Learning Traps , 2018, EC.

[15]  Shipra Agrawal,et al.  Analysis of Thompson Sampling for the Multi-armed Bandit Problem , 2011, COLT.

[16]  Moshe Babaioff,et al.  Characterizing truthful multi-armed bandit mechanisms: extended abstract , 2008, EC '09.

[17]  Benjamin Van Roy,et al.  Learning to Optimize via Posterior Sampling , 2013, Math. Oper. Res..

[18]  Sébastien Bubeck,et al.  Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems , 2012, Found. Trends Mach. Learn..

[19]  H. Robbins,et al.  Asymptotically efficient adaptive allocation rules , 1985 .

[20]  Yeon-Koo Che,et al.  Recommender Systems as Mechanisms for Social Learning , 2018 .

[21]  Sham M. Kakade,et al.  Optimal Dynamic Mechanism Design and the Virtual Pivot Mechanism , 2013, Oper. Res..

[22]  Nikhil R. Devanur,et al.  The price of truthfulness for pay-per-click auctions , 2009, EC '09.

[23]  Zhiwei Steven Wu,et al.  The Perils of Exploration under Competition: A Computational Modeling Approach , 2019, EC.

[24]  Ilya Segal,et al.  An Efficient Dynamic Mechanism , 2013 .

[25]  E. Glen Weyl,et al.  Descending Price Optimally Coordinates Search , 2016, EC.

[26]  J. Bather,et al.  Multi‐Armed Bandit Allocation Indices , 1990 .

[27]  M. Cripps,et al.  Strategic Experimentation with Exponential Bandits , 2003 .

[28]  Jon M. Kleinberg,et al.  Incentivizing exploration , 2014, EC.

[29]  Aleksandrs Slivkins,et al.  Introduction to Multi-Armed Bandits , 2019, Found. Trends Mach. Learn..

[30]  Gábor Lugosi,et al.  Prediction, learning, and games , 2006 .

[31]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[32]  Yishay Mansour,et al.  Bayesian Incentive-Compatible Bandit Exploration , 2018 .

[33]  Bangrui Chen,et al.  Incentivizing Exploration by Heterogeneous Users , 2018, COLT.

[34]  D. Bergemann,et al.  The Dynamic Pivot Mechanism , 2008 .

[35]  Shipra Agrawal,et al.  Further Optimal Regret Bounds for Thompson Sampling , 2012, AISTATS.

[36]  Aleksandrs Slivkins,et al.  Adaptive contract design for crowdsourcing markets: bandit algorithms for repeated principal-agent problems , 2014, J. Artif. Intell. Res..

[37]  Yishay Mansour,et al.  Bayesian Exploration: Incentivizing Exploration in Bayesian Games , 2016, EC.

[38]  Moshe Tennenholtz,et al.  Economic Recommendation Systems , 2015, ArXiv.

[39]  Aleksandrs Slivkins,et al.  Bandits with Knapsacks , 2013, 2013 IEEE 54th Annual Symposium on Foundations of Computer Science.

[40]  Andreas Krause,et al.  Truthful incentives in crowdsourcing tasks using regret minimization mechanisms , 2013, WWW.

[41]  Yishay Mansour,et al.  Implementing the “Wisdom of the Crowd” , 2013, Journal of Political Economy.

[42]  Benjamin Van Roy,et al.  A Tutorial on Thompson Sampling , 2017, Found. Trends Mach. Learn..

[43]  Kostas Bimpikis,et al.  Crowdsourcing Exploration , 2018, Manag. Sci..