(Almost) Free Incentivized Exploration from Decentralized Learning Agents

Incentivized exploration in multi-armed bandits (MAB) has witnessed increasing interests and many progresses in recent years, where a principal offers bonuses to agents to do explorations on her behalf. However, almost all existing studies are confined to temporary myopic agents. In this work, we break this barrier and study incentivized exploration with multiple and long-term strategic agents, who have more complicated behaviors that often appear in real-world applications. An important observation of this work is that strategic agents’ intrinsic needs of learning benefit (instead of harming) the principal’s explorations by providing “free pulls”. Moreover, it turns out that increasing the population of agents significantly lowers the principal’s burden of incentivizing. The key and somewhat surprising insight revealed from our results is that when there are sufficiently many learning agents involved, the exploration process of the principal can be (almost) free. Our main results are built upon three novel components which may be of independent interest: (1) a simple yet provably effective incentive-provision strategy; (2) a carefully crafted best arm identification algorithm for rewards aggregated under unequal confidences; (3) a high-probability finite-time lower bound of UCB algorithms. Experimental results are provided to complement the theoretical analysis.

[1]  Alexandre Proutière,et al.  Lipschitz Bandits: Regret Lower Bound and Optimal Algorithms , 2014, COLT.

[2]  Zhiyuan Liu,et al.  Incentivized Exploration for Multi-Armed Bandits under Reward Drift , 2020, AAAI.

[3]  Shie Mannor,et al.  PAC Bounds for Multi-armed Bandit and Markov Decision Processes , 2002, COLT.

[4]  Siwei Wang,et al.  Multi-armed Bandits with Compensation , 2018, NeurIPS.

[5]  Robert D. Nowak,et al.  Best-arm identification algorithms for multi-armed bandits in the fixed confidence setting , 2014, 2014 48th Annual Conference on Information Sciences and Systems (CISS).

[6]  Haifeng Xu,et al.  Incentivizing Exploration in Linear Bandits under Information Gap , 2021, ArXiv.

[7]  Yishay Mansour,et al.  Implementing the “Wisdom of the Crowd” , 2013, Journal of Political Economy.

[8]  Nicole Immorlica,et al.  Incentivizing Exploration with Selective Data Disclosure , 2018, EC.

[9]  Oren Somekh,et al.  Almost Optimal Exploration in Multi-Armed Bandits , 2013, ICML.

[10]  Yishay Mansour,et al.  Bayesian Incentive-Compatible Bandit Exploration , 2018 .

[11]  Shipra Agrawal,et al.  Analysis of Thompson Sampling for the Multi-armed Bandit Problem , 2011, COLT.

[12]  Csaba Szepesvari,et al.  Bandit Algorithms , 2020 .

[13]  Bangrui Chen,et al.  Incentivizing Exploration by Heterogeneous Users , 2018, COLT.

[14]  Cong Shen,et al.  Federated Multi-Armed Bandits , 2021, AAAI.

[15]  Dominik D. Freydenberger,et al.  Can We Learn to Gamble Efficiently? , 2010, COLT.

[16]  Yishay Mansour,et al.  Bayesian Exploration: Incentivizing Exploration in Bayesian Games , 2016, EC.

[17]  Aurélien Garivier,et al.  The KL-UCB Algorithm for Bounded Stochastic Bandits and Beyond , 2011, COLT.

[18]  Alessandro Lazaric,et al.  Best Arm Identification: A Unified Approach to Fixed Budget and Fixed Confidence , 2012, NIPS.

[19]  Aleksandrs Slivkins,et al.  Sample Complexity of Incentivized Exploration , 2020, ArXiv.

[20]  Zhaowei Zhu,et al.  Federated Bandit: A Gossiping Approach , 2021, SIGMETRICS.

[21]  Jon M. Kleinberg,et al.  Incentivizing exploration , 2014, EC.

[22]  Pierre Perrault Efficient Learning in Stochastic Combinatorial Semi-Bandits. (Apprentissage Efficient dans les Problèmes de Semi-Bandits Stochastiques Combinatoires) , 2020 .

[23]  Cong Shen,et al.  Federated Multi-armed Bandits with Personalization , 2021, AISTATS.

[24]  T. L. Lai Andherbertrobbins Asymptotically Efficient Adaptive Allocation Rules , 2022 .

[25]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[26]  Li Han,et al.  Incentivizing Exploration with Heterogeneous Value of Money , 2015, WINE.

[27]  Aurélien Garivier,et al.  Optimal Best Arm Identification with Fixed Confidence , 2016, COLT.

[28]  Alexandre Proutière,et al.  Combinatorial Bandits Revisited , 2015, NIPS.

[29]  Sébastien Bubeck,et al.  Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems , 2012, Found. Trends Mach. Learn..

[30]  Massimo Franceschetti,et al.  Secure-UCB: Saving Stochastic Bandits from Poisoning Attacks via Limited Data Verification , 2021, ArXiv.

[31]  Rémi Munos,et al.  Pure exploration in finitely-armed and continuous-armed bandits , 2011, Theor. Comput. Sci..

[32]  Matthew Malloy,et al.  lil' UCB : An Optimal Exploration Algorithm for Multi-Armed Bandits , 2013, COLT.