Bayesian Incentive-Compatible Bandit Exploration

Individual decision-makers consume information revealed by the previous decision makers, and produce information that may help in future decision makers. This phenomenon is common in a wide range of scenarios in the Internet economy, as well as elsewhere, such as medical decisions. Each decision maker when required to select an action, would individually prefer to exploit, select the highest expected reward action conditional on her information. At the same time, each decision maker would prefer previous decision makers to explore, producing information about the rewards of various actions. A social planner, by means of carefully designed information disclosure, can incentivize the agents to balance the exploration and exploitation, and maximize social welfare. We formulate this problem as a multi-arm bandit problem (and various generalizations thereof) under incentive-compatibility constraints induced by agents' Bayesian priors. We design an incentive-compatible bandit algorithm for the social planner with asymptotically optimal regret. Further, we provide a black-box reduction from an arbitrary multi-arm bandit algorithm to an incentive-compatible one, with only a constant multiplicative increase in regret. This reduction works for very general bandit settings, even ones that incorporate contexts and arbitrary partial feedback.

[1]  W. R. Thompson ON THE LIKELIHOOD THAT ONE UNKNOWN PROBABILITY EXCEEDS ANOTHER IN VIEW OF THE EVIDENCE OF TWO SAMPLES , 1933 .

[2]  E. L. Lehmann,et al.  Theory of point estimation , 1950 .

[3]  W. Hoeffding Probability Inequalities for sums of Bounded Random Variables , 1963 .

[4]  M. Woodroofe A One-Armed Bandit Problem with a Concomitant Variable , 1979 .

[5]  J. Gittins Bandit processes and dynamic allocation indices , 1979 .

[6]  Christian M. Ernst,et al.  Multi-armed Bandit Allocation Indices , 1989 .

[7]  J. Bather,et al.  Multi‐Armed Bandit Allocation Indices , 1990 .

[8]  Andrew W. Moore,et al.  Hoeffding Races: Accelerating Model Selection Search for Classification and Function Approximation , 1993, NIPS.

[9]  M. Hellmich Monitoring Clinical Trials with Multiple Arms , 2001, Biometrics.

[10]  Peter Auer,et al.  The Nonstochastic Multiarmed Bandit Problem , 2002, SIAM J. Comput..

[11]  Shie Mannor,et al.  PAC Bounds for Multi-armed Bandit and Markov Decision Processes , 2002, COLT.

[12]  M. Cripps,et al.  Strategic Experimentation with Exponential Bandits , 2003 .

[13]  John N. Tsitsiklis,et al.  The Sample Complexity of Exploration in the Multi-Armed Bandit Problem , 2004, J. Mach. Learn. Res..

[14]  Frank Thomson Leighton,et al.  The value of knowing a demand curve: bounds on regret for online posted-price auctions , 2003, 44th Annual IEEE Symposium on Foundations of Computer Science, 2003. Proceedings..

[15]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[16]  Andrew W. Moore,et al.  The Racing Algorithm: Model Selection for Lazy Learners , 1997, Artificial Intelligence Review.

[17]  H. Vincent Poor,et al.  Bandit problems with side observations , 2005, IEEE Transactions on Automatic Control.

[18]  Eli Upfal,et al.  Probability and Computing: Randomized Algorithms and Probabilistic Analysis , 2005 .

[19]  R. Simon,et al.  Adaptive Signature Design: An Adaptive Clinical Trial Design for Generating and Prospectively Testing A Gene Expression Signature for Sensitive Patients , 2005, Clinical Cancer Research.

[20]  Gábor Lugosi,et al.  Prediction, learning, and games , 2006 .

[21]  Shie Mannor,et al.  Action Elimination and Stopping Conditions for the Multi-Armed Bandit and Reinforcement Learning Problems , 2006, J. Mach. Learn. Res..

[22]  Ilya Segal,et al.  An Efficient Dynamic Mechanism , 2013 .

[23]  Tamás Linder,et al.  The On-Line Shortest Path Problem Under Partial Monitoring , 2007, J. Mach. Learn. Res..

[24]  J. Langford,et al.  The Epoch-Greedy algorithm for contextual multi-armed bandits , 2007, NIPS 2007.

[25]  R. Simon,et al.  Biomarker-adaptive threshold design: a procedure for evaluating treatment with possible biomarker-defined subset effect. , 2007, Journal of the National Cancer Institute.

[26]  Sudipto Guha,et al.  Approximation algorithms for budgeted learning problems , 2007, STOC '07.

[27]  R. Gray,et al.  Multi-Arm Clinical Trials of New Agents: Some Design Considerations , 2008, Clinical Cancer Research.

[28]  D. Bergemann,et al.  The Dynamic Pivot Mechanism , 2008 .

[29]  Rémi Munos,et al.  Pure Exploration for Multi-Armed Bandit Problems , 2008, ArXiv.

[30]  Shein-Chung Chow,et al.  Adaptive design methods in clinical trials – a review , 2008, Orphanet journal of rare diseases.

[31]  Moshe Babaioff,et al.  Characterizing truthful multi-armed bandit mechanisms: extended abstract , 2009, EC '09.

[32]  Ashish Goel,et al.  The ratio index for budgeted learning, with applications , 2008, SODA.

[33]  Omar Besbes,et al.  Dynamic Pricing Without Knowing the Demand Function: Risk Bounds and Near-Optimal Algorithms , 2009, Oper. Res..

[34]  Nikhil R. Devanur,et al.  The price of truthfulness for pay-per-click auctions , 2009, EC '09.

[35]  Emir Kamenica,et al.  Bayesian Persuasion , 2009 .

[36]  Moshe Babaioff,et al.  Truthful mechanisms with implicit payment computation , 2010, EC '10.

[37]  Jean-Yves Audibert,et al.  Regret Bounds and Minimax Policies under Partial Monitoring , 2010, J. Mach. Learn. Res..

[38]  R. Simon,et al.  The Cross-Validated Adaptive Signature Design , 2010, Clinical Cancer Research.

[39]  D. Bergemann,et al.  Dynamic Auctions: A Survey , 2010 .

[40]  Gustavo Manso Motivating Innovation , 2010 .

[41]  Robert E. Schapire,et al.  Non-Stochastic Bandit Slate Problems , 2010, NIPS.

[42]  Gábor Lugosi,et al.  Minimax Policies for Combinatorial Prediction Games , 2011, COLT.

[43]  M. Maitland,et al.  Clinical trials in the era of personalized oncology , 2011, CA: a cancer journal for clinicians.

[44]  Kevin D. Glazebrook,et al.  Multi-Armed Bandit Allocation Indices: Gittins/Multi-Armed Bandit Allocation Indices , 2011 .

[45]  Andrzej Skrzypacz,et al.  Selling Information , 2010, Journal of Political Economy.

[46]  Robert D. Kleinberg,et al.  Dynamic pricing with limited supply , 2012, EC '12.

[47]  Sébastien Bubeck,et al.  Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems , 2012, Found. Trends Mach. Learn..

[48]  G. Vaidyanathan Redefining Clinical Trials: The Age of Personalized Medicine , 2012, Cell.

[49]  Toniann Pitassi,et al.  Fairness through awareness , 2011, ITCS '12.

[50]  Chien-Ju Ho Adaptive Contract Design for Crowdsourcing , 2013 .

[51]  Patrick Hummel,et al.  Learning and incentives in user-generated content: multi-armed bandits with endogenous arms , 2013, ITCS '13.

[52]  Aleksandrs Slivkins,et al.  Bandits with Knapsacks , 2013, 2013 IEEE 54th Annual Symposium on Foundations of Computer Science.

[53]  Chien-Ju Ho,et al.  Adaptive Task Assignment for Crowdsourced Classification , 2013, ICML.

[54]  Noga Alon,et al.  From Bandits to Experts: A Tale of Domination and Independence , 2013, NIPS.

[55]  Ittai Abraham,et al.  Adaptive Crowdsourcing Algorithms for the Bandit Survey Problem , 2013, COLT.

[56]  Yishay Mansour,et al.  Implementing the “Wisdom of the Crowd” , 2013, Journal of Political Economy.

[57]  Sham M. Kakade,et al.  Optimal Dynamic Mechanism Design and the Virtual Pivot Mechanism , 2013, Oper. Res..

[58]  Andreas Krause,et al.  Truthful incentives in crowdsourcing tasks using regret minimization mechanisms , 2013, WWW.

[59]  Moshe Babaioff,et al.  Characterizing truthful multi-armed bandit mechanisms: extended abstract , 2008, EC '09.

[60]  Aleksandrs Slivkins,et al.  Online decision making in crowdsourcing markets: theoretical challenges , 2013, SECO.

[61]  Csaba Szepesvári,et al.  Partial Monitoring - Classification, Regret Bounds, and Algorithms , 2014, Math. Oper. Res..

[62]  Aleksandrs Slivkins,et al.  Adaptive contract design for crowdsourcing markets: bandit algorithms for repeated principal-agent problems , 2014, J. Artif. Intell. Res..

[63]  M. Parmar,et al.  More multiarm randomised trials of superiority are needed , 2014, The Lancet.

[64]  John Langford,et al.  Taming the Monster: A Fast and Simple Algorithm for Contextual Bandits , 2014, ICML.

[65]  Jon M. Kleinberg,et al.  Incentivizing exploration , 2014, EC.

[66]  S. Dangi‐Garimella New Oncology Clinical Trial Designs: What Works and What Doesn't? , 2015 .

[67]  Yeon-Koo Che,et al.  Optimal Design for Social Learning , 2015 .

[68]  Branislav Kveton,et al.  Efficient Learning in Large-Scale Combinatorial Semi-Bandits , 2014, ICML.

[69]  A. Redig,et al.  Basket trials and the evolution of clinical trial design in an era of genomic medicine. , 2015, Journal of clinical oncology : official journal of the American Society of Clinical Oncology.

[70]  Jeffrey C. Ely,et al.  Suspense and Surprise , 2015, Journal of Political Economy.

[71]  Noga Alon,et al.  Online Learning with Feedback Graphs: Beyond Bandits , 2015, COLT.

[72]  Moshe Tennenholtz,et al.  Economic Recommendation Systems , 2015, ArXiv.

[73]  E. Glen Weyl,et al.  Descending Price Optimally Coordinates Search , 2016, EC.

[74]  Emir Kamenica,et al.  A Rothschild-Stiglitz Approach to Bayesian Persuasion , 2016 .

[75]  D. Bergemann,et al.  Information Design, Bayesian Persuasion and Bayes Correlated Equilibrium , 2016 .

[76]  Yishay Mansour,et al.  Bayesian Exploration: Incentivizing Exploration in Bayesian Games , 2016, EC.

[77]  Aaron Roth,et al.  Fairness in Learning: Classic and Contextual Bandits , 2016, NIPS.

[78]  Nathan Srebro,et al.  Equality of Opportunity in Supervised Learning , 2016, NIPS.

[79]  Sampath Kannan,et al.  Fairness Incentives for Myopic Agents , 2017, EC.

[80]  Alexandra Chouldechova,et al.  Fair prediction with disparate impact: A study of bias in recidivism prediction instruments , 2016, Big Data.

[81]  Yang Liu,et al.  Calibrated Fairness in Bandits , 2017, ArXiv.

[82]  Aaron Roth,et al.  Meritocratic Fairness for Cross-Population Selection , 2017, ICML.

[83]  Jon M. Kleinberg,et al.  Inherent Trade-Offs in the Fair Determination of Risk Scores , 2016, ITCS.

[84]  T. L. Lai Andherbertrobbins Asymptotically Efficient Adaptive Allocation Rules , 2022 .