Incentivizing exploration

We study a Bayesian multi-armed bandit (MAB) setting in which a principal seeks to maximize the sum of expected time-discounted rewards obtained by pulling arms, when the arms are actually pulled by selfish and myopic individuals. Since such individuals pull the arm with highest expected posterior reward (i.e., they always exploit and never explore), the principal must incentivize them to explore by offering suitable payments. Among others, this setting models crowdsourced information discovery and funding agencies incentivizing scientists to perform high-risk, high-reward research. We explore the tradeoff between the principal's total expected time-discounted incentive payments, and the total time-discounted rewards realized. Specifically, with a time-discount factor γ ∈ (0,1), let OPT denote the total expected time-discounted reward achievable by a principal who pulls arms directly in a MAB problem, without having to incentivize selfish agents. We call a pair (ρ,b) ∈ [0,1]2 consisting of a reward ρ and payment b achievable if for every MAB instance, using expected time-discounted payments of at most b•OPT, the principal can guarantee an expected time-discounted reward of at least ρ•OPT. Our main result is an essentially complete characterization of achievable (payment, reward) pairs: if √b+√1-ρ>√γ, then (ρ,b) is achievable, and if √b+√1-ρ<√γ, then (ρ,b) is not achievable. In proving this characterization, we analyze so-called time-expanded policies, which in each step let the agents choose myopically with some probability p, and incentivize them to choose "optimally" with probability 1-p. The analysis of time-expanded policies leads to a question that may be of independent interest: If the same MAB instance (without selfish agents) is considered under two different time-discount rates γ > η, how small can the ratio of OPTη to OPTγ be? We give a complete answer to this question, showing that OPTη ≥ (1-γ)2/(1-η)2 • OPTγ, and that this bound is tight.

[1]  D. Bergemann,et al.  Learning and Strategic Pricing , 1996 .

[2]  Yishay Mansour,et al.  Implementing the “Wisdom of the Crowd” , 2013, Journal of Political Economy.

[3]  Jean Walrand,et al.  Extensions of the multiarmed bandit problem: The discounted case , 1985 .

[4]  T. L. Lai Andherbertrobbins Asymptotically Efficient Adaptive Allocation Rules , 2022 .

[5]  Sudipto Guha,et al.  Multi-armed Bandits with Metric Switching Costs , 2009, ICALP.

[6]  Demosthenis Teneketzis,et al.  Multi-Armed Bandit Problems , 2008 .

[7]  R. Pieters,et al.  Working Paper , 1994 .

[8]  Ittai Abraham,et al.  Adaptive Crowdsourcing Algorithms for the Bandit Survey Problem , 2013, COLT.

[9]  Aleksandrs Slivkins,et al.  Bandits with Knapsacks , 2013, 2013 IEEE 54th Annual Symposium on Foundations of Computer Science.

[10]  Onésimo Hernández-Lerma,et al.  Controlled Markov Processes , 1965 .

[11]  Kevin D. Glazebrook,et al.  Multi-Armed Bandit Allocation Indices: Gittins/Multi-Armed Bandit Allocation Indices , 2011 .

[12]  Ashish Goel,et al.  The ratio index for budgeted learning, with applications , 2008, SODA.

[13]  Andreas Krause,et al.  Truthful incentives in crowdsourcing tasks using regret minimization mechanisms , 2013, WWW.

[14]  Michael N. Katehakis,et al.  The Multi-Armed Bandit Problem: Decomposition and Computation , 1987, Math. Oper. Res..

[15]  H. Robbins,et al.  Asymptotically efficient adaptive allocation rules , 1985 .

[16]  Robert D. Kleinberg,et al.  Learning on a budget: posted price mechanisms for online procurement , 2012, EC '12.

[17]  C. Lintott,et al.  Galaxy Zoo: morphologies derived from visual inspection of galaxies from the Sloan Digital Sky Survey , 2008, 0804.4483.

[18]  Wang De-lin On electronic commerce , 2008 .

[19]  Chien-Ju Ho Adaptive Contract Design for Crowdsourcing , 2013 .

[20]  Steve Kelling,et al.  Improving Your Chances: Boosting Citizen Science Discovery , 2013, HCOMP.

[21]  Christian M. Ernst,et al.  Multi-armed Bandit Allocation Indices , 1989 .

[22]  Aleksandrs Slivkins,et al.  Adaptive contract design for crowdsourcing markets: bandit algorithms for repeated principal-agent problems , 2014, J. Artif. Intell. Res..

[23]  Nicolò Cesa-Bianchi,et al.  Gambling in a rigged casino: The adversarial multi-armed bandit problem , 1995, Proceedings of IEEE 36th Annual Foundations of Computer Science.

[24]  J. Bather,et al.  Multi‐Armed Bandit Allocation Indices , 1990 .

[25]  M. Cripps,et al.  Strategic Experimentation with Exponential Bandits , 2003 .

[26]  Sudipto Guha,et al.  Approximation algorithms for budgeted learning problems , 2007, STOC '07.

[27]  Sudipto Guha,et al.  Approximation Algorithms for Bayesian Multi-Armed Bandit Problems , 2013, ArXiv.

[28]  P. Whittle Multi‐Armed Bandits and the Gittins Index , 1980 .

[29]  Eli Upfal,et al.  Multi-Armed Bandits in Metric Spaces ∗ , 2008 .

[30]  H. Robbins Some aspects of the sequential design of experiments , 1952 .

[31]  Peter Auer,et al.  The Nonstochastic Multiarmed Bandit Problem , 2002, SIAM J. Comput..

[32]  Dexter Kozen,et al.  Collective Inference on Markov Models for Modeling Bird Migration , 2007, NIPS.