Incentivizing Exploration with Heterogeneous Value of Money

Recently, Frazier et al.i¾?proposed a natural model for crowdsourced exploration of different a priori unknown options: a principal is interested in the long-term welfare of a population of agents who arrive one by one in a multi-armed bandit setting. However, each agent is myopic, so in order to incentivize him to explore options with better long-term prospects, the principal must offer the agent money. Frazier eti¾?al. showed that a simple class of policies called time-expanded are optimal in the worst case, and characterized their budget-reward tradeoff. The previous work assumed that all agents are equally and uniformly susceptible to financial incentives. In reality, agents may have different utility for money. We therefore extend the model of Frazier et al.i¾?to allow agents that have heterogeneous and non-linear utilities for money. The principal is informed of the agent's tradeoff via a signal that could be more or less informative. Our main result is to show that a convex program can be used to derive a signal-dependent time-expanded policy which achieves the best possible Lagrangian reward in the worst case. The worst-case guarantee is matched by so-called "Diamonds in the Rough" instances; the proof that the guarantees match is based on showing that two different convex programs have the same optimal solution for these specific instances.

[1]  Sudipto Guha,et al.  Approximation Algorithms for Bayesian Multi-Armed Bandit Problems , 2013, ArXiv.

[2]  Christian M. Ernst,et al.  Multi-armed Bandit Allocation Indices , 1989 .

[3]  D. Luenberger Optimization by Vector Space Methods , 1968 .

[4]  Aleksandrs Slivkins,et al.  Online decision making in crowdsourcing markets: theoretical challenges , 2013, SECO.

[5]  J. Marschak,et al.  ECONOMIC COMPARABILITY OF INFORMATION SYSTEMS. , 1968 .

[6]  M. Spence Job Market Signaling , 1973 .

[7]  George J. Stigler,et al.  Information in the Labor Market , 1962, Journal of Political Economy.

[8]  P. Whittle Multi‐Armed Bandits and the Gittins Index , 1980 .

[9]  Nicolò Cesa-Bianchi,et al.  Gambling in a rigged casino: The adversarial multi-armed bandit problem , 1995, Proceedings of IEEE 36th Annual Foundations of Computer Science.

[10]  George A. Akerlof The Market for “Lemons”: Quality Uncertainty and the Market Mechanism , 1970 .

[11]  Michael N. Katehakis,et al.  The Multi-Armed Bandit Problem: Decomposition and Computation , 1987, Math. Oper. Res..

[12]  Aleksandrs Slivkins,et al.  Adaptive Contract Design for Crowdsourcing Markets: Bandit Algorithms for Repeated Principal-Agent Problems , 2016, J. Artif. Intell. Res..

[13]  Peter Auer,et al.  The Nonstochastic Multiarmed Bandit Problem , 2002, SIAM J. Comput..

[14]  Ehud Lehrer,et al.  Signaling and Mediation in Games with Common Interests , 2006, Games Econ. Behav..

[15]  Yishay Mansour,et al.  Bayesian Incentive-Compatible Bandit Exploration , 2018 .

[16]  Sudipto Guha,et al.  Approximation algorithms for budgeted learning problems , 2007, STOC '07.

[17]  Jon M. Kleinberg,et al.  Incentivizing exploration , 2014, EC.

[18]  J. Bather,et al.  Multi‐Armed Bandit Allocation Indices , 1990 .

[19]  Eli Upfal,et al.  Multi-Armed Bandits in Metric Spaces ∗ , 2008 .

[20]  H. Robbins Some aspects of the sequential design of experiments , 1952 .

[21]  T. L. Lai Andherbertrobbins Asymptotically Efficient Adaptive Allocation Rules , 2022 .

[22]  Ashish Goel,et al.  The ratio index for budgeted learning, with applications , 2008, SODA.

[23]  D. Bergemann,et al.  Learning and Strategic Pricing , 1996 .

[24]  Marco Scarsini,et al.  Positive value of information in games , 2003, Int. J. Game Theory.

[25]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[26]  Sudipto Guha,et al.  Multi-armed Bandits with Metric Switching Costs , 2009, ICALP.

[27]  J. Hirshleifer The Private and Social Value of Information and the Reward to Inventive Activity , 1971 .

[28]  Yishay Mansour,et al.  Implementing the “Wisdom of the Crowd” , 2013, Journal of Political Economy.

[29]  Aleksandrs Slivkins,et al.  Bandits with Knapsacks , 2013, 2013 IEEE 54th Annual Symposium on Foundations of Computer Science.

[30]  Kevin D. Glazebrook,et al.  Multi-Armed Bandit Allocation Indices: Gittins/Multi-Armed Bandit Allocation Indices , 2011 .

[31]  Andreas Krause,et al.  Truthful incentives in crowdsourcing tasks using regret minimization mechanisms , 2013, WWW.