Incentivizing Exploration with Selective Data Disclosure

We study the design of rating systems that incentivize (more) efficient social learning among self-interested agents. Agents arrive sequentially and are presented with a set of possible actions, each of which yields a positive reward with an unknown probability. A disclosure policy sends messages about the rewards of previously-chosen actions to arriving agents. These messages can alter agents' incentives towards exploration, taking potentially sub-optimal actions for the sake of learning more about their rewards. Prior work achieves much progress with disclosure policies that merely recommend an action to each user, without any other supporting information, and sometimes recommend exploratory actions. All this work relies heavily on standard, yet very strong rationality assumptions. However, these assumptions are quite problematic in the context of the motivating applications: recommendation systems such as Yelp, Amazon, or Netflix, and macthing markets such as AirBnB. It is very unclear whether users would know and understand a complicated disclosure policy announced by the principal, let alone trust the principal to faithfully implement it. (The principal may deviate from the announced policy either intentionally, or due to insufficient information about the users, or because of bugs in implementation.) Even if the users understand the policy and trust that it was implemented as claimed, they might not react to it rationally, particularly given the lack of supporting information and the possibility of being singled out for exploration. For example, users may find such disclosure policies unacceptable and leave the system. We study a particular class of disclosure policies that use messages, called unbiased subhistories, consisting of the actions and rewards from a subsequence of past agents. Each subsequence is chosen ahead of time, according to a predetermined partial order on the rounds. We posit a flexible model of frequentist agent response, which we argue is plausible for this class of "order-based" disclosure policies. We measure the performance of a policy by its regret, i.e., the difference in expected total reward between the best action and the policy. A disclosure policy that reveals full history in each round risks inducing herding behavior among the agents, and typically has regret linear in the time horizon T. Our main result is an order-based disclosure policy that obtains regret ~O (√T). This regret is known to be optimal in the worst case over reward distributions, even absent incentives. We also exhibit simpler order-based policies with higher, but still sublinear, regret. These policies can be interpreted as dividing a sublinear number of agents into constant-sized focus groups, whose histories are then revealed to future agents. Helping market participants find whatever they are looking for, and coordinating their search and exploration behavior in a globally optimal way, is an essential part of market design. This paper continues the line of work on "incentivized exploration": essentially, exploration-exploitation learning in the presence of self-interested users whose incentives are skewed in favor of exploitation. Conceptually, we study the interplay of information design, social learning, and multi-armed bandit algorithms. To the best of our knowledge, this is the first paper in the literature on incentivized exploration (and possibly in the broader literature on "learning and incentives") which attempts to mitigate the limitations of standard economic assumptions. Full version: https://arxiv.org/abs/1811.06026.

[1]  Moshe Tennenholtz,et al.  Economic Recommendation Systems , 2015, ArXiv.

[2]  Bangrui Chen,et al.  Incentivizing Exploration by Heterogeneous Users , 2018, COLT.

[3]  Aleksandrs Slivkins,et al.  Bandits with Knapsacks , 2013, 2013 IEEE 54th Annual Symposium on Foundations of Computer Science.

[4]  D. Bergemann,et al.  The Dynamic Pivot Mechanism , 2008 .

[5]  Yishay Mansour,et al.  Implementing the “Wisdom of the Crowd” , 2014, Journal of Political Economy.

[6]  Andreas Krause,et al.  Truthful incentives in crowdsourcing tasks using regret minimization mechanisms , 2013, WWW.

[7]  Sampath Kannan,et al.  Fairness Incentives for Myopic Agents , 2017, EC.

[8]  Aleksandrs Slivkins,et al.  Introduction to Multi-Armed Bandits , 2019, Found. Trends Mach. Learn..

[9]  Carlos Riquelme,et al.  Human Interaction with Recommendation Systems , 2017, AISTATS.

[10]  Annie Liang,et al.  Optimal and Myopic Information Acquisition , 2017, EC.

[11]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[12]  Jon M. Kleinberg,et al.  Incentivizing exploration , 2014, EC.

[13]  Moshe Tennenholtz,et al.  Social Learning and the Innkeeper's Challenge , 2019, EC.

[14]  Marciano M. Siniscalchi,et al.  Ambiguity and Ambiguity Aversion , 2014 .

[15]  H. Robbins,et al.  Asymptotically efficient adaptive allocation rules , 1985 .

[16]  Khashayar Khosravi,et al.  Mostly Exploration-Free Algorithms for Contextual Bandits , 2017, Manag. Sci..

[17]  Sampath Kannan,et al.  A Smoothed Analysis of the Greedy Algorithm for the Linear Contextual Bandit Problem , 2018, NeurIPS.

[18]  J. Bather,et al.  Multi‐Armed Bandit Allocation Indices , 1990 .

[19]  M. Cripps,et al.  Strategic Experimentation with Exponential Bandits , 2003 .

[20]  Zhiwei Steven Wu,et al.  The Externalities of Exploration and How Data Diversity Helps Exploitation , 2018, COLT.

[21]  Peter Auer,et al.  The Nonstochastic Multiarmed Bandit Problem , 2002, SIAM J. Comput..

[22]  Sébastien Bubeck,et al.  Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems , 2012, Found. Trends Mach. Learn..

[23]  Yishay Mansour,et al.  Bayesian Incentive-Compatible Bandit Exploration , 2018 .

[24]  Yeon-Koo Che,et al.  Recommender Systems as Mechanisms for Social Learning , 2018 .

[25]  A. Kolotilin Optimal Information Disclosure: A Linear Programming Approach , 2016 .

[26]  Vianney Perchet,et al.  Batched Bandit Problems , 2015, COLT.

[27]  Ilya Segal,et al.  An Efficient Dynamic Mechanism , 2013 .

[28]  E. Glen Weyl,et al.  Descending Price Optimally Coordinates Search , 2016, EC.

[29]  Patrick Hummel,et al.  Learning and incentives in user-generated content: multi-armed bandits with endogenous arms , 2013, ITCS '13.

[30]  Stephen Morris,et al.  Information Design, Bayesian Persuasion and Bayes Correlated Equilibrium , 2016 .

[31]  Nikhil R. Devanur,et al.  The price of truthfulness for pay-per-click auctions , 2009, EC '09.

[32]  Omar Besbes,et al.  Dynamic Pricing Without Knowing the Demand Function: Risk Bounds and Near-Optimal Algorithms , 2009, Oper. Res..

[33]  Frank Thomson Leighton,et al.  The value of knowing a demand curve: bounds on regret for online posted-price auctions , 2003, 44th Annual IEEE Symposium on Foundations of Computer Science, 2003. Proceedings..

[34]  Sham M. Kakade,et al.  Optimal Dynamic Mechanism Design and the Virtual Pivot Mechanism , 2013, Oper. Res..

[35]  Aleksandrs Slivkins,et al.  Adaptive contract design for crowdsourcing markets: bandit algorithms for repeated principal-agent problems , 2014, J. Artif. Intell. Res..

[36]  Annie Liang,et al.  Overabundant Information and Learning Traps , 2018, EC.