Incentive design for adaptive agents

We consider a setting in which a principal seeks to induce an adaptive agent to select a target action by providing incentives on one or more actions. The agent maintains a belief about the value for each action---which may update based on experience---and selects at each time step the action with the maximal sum of value and associated incentive. The principal observes the agent's selection, but has no information about the agent's current beliefs or belief update process. For inducing the target action as soon as possible, or as often as possible over a fixed time period, it is optimal for a principal with a per-period budget to assign the budget to the target action and wait for the agent to want to make that choice. But with an across-period budget, no algorithm can provide good performance on all instances without knowledge of the agent's update process, except in the particular case in which the goal is to induce the agent to select the target action once. We demonstrate ways to overcome this strong negative result with knowledge about the agent's beliefs, by providing a tractable algorithm for solving the offline problem when the principal has perfect knowledge, and an analytical solution for an instance of the problem in which partial knowledge is available.

[1]  Christian M. Ernst,et al.  Multi-armed Bandit Allocation Indices , 1989 .

[2]  J. Bather,et al.  Multi‐Armed Bandit Allocation Indices , 1990 .

[3]  Ronen I. Brafman,et al.  On Partially Controlled Multi-Agent Systems , 1996, J. Artif. Intell. Res..

[4]  Allan Borodin,et al.  Online computation and competitive analysis , 1998 .

[5]  Andrew Y. Ng,et al.  Policy Invariance Under Reward Transformations: Theory and Application to Reward Shaping , 1999, ICML.

[6]  Craig Boutilier,et al.  A POMDP formulation of preference elicitation problems , 2002, AAAI/IAAI.

[7]  Jon Doyle,et al.  Prospects for Preferences , 2004, Comput. Intell..

[8]  Robert D. Kleinberg,et al.  Online decision problems with large strategy sets , 2005 .

[9]  David C. Parkes,et al.  Optimal Coordinated Planning Amongst Self-Interested Agents with Private State , 2006, UAI.

[10]  George E. Monahan,et al.  A Survey of Partially Observable Markov Decision Processes: Theory, Models, and Algorithms , 2007 .

[11]  H. Robbins Some aspects of the sequential design of experiments , 1952 .

[12]  David C. Parkes,et al.  Value-Based Policy Teaching with Active Indirect Elicitation , 2008, AAAI.

[13]  D. Bergemann,et al.  The Dynamic Pivot Mechanism , 2008 .

[14]  Peter Stone,et al.  Interactively shaping agents via human reinforcement: the TAMER framework , 2009, K-CAP '09.

[15]  David C. Parkes,et al.  Policy teaching through reward function learning , 2009, EC '09.

[16]  David C. Parkes,et al.  A General Approach to Environment Design with One Agent , 2009, IJCAI.

[17]  Moshe Babaioff,et al.  Characterizing truthful multi-armed bandit mechanisms: extended abstract , 2009, EC '09.

[18]  Sarit Kraus,et al.  To teach or not to teach?: decision making under uncertainty in ad hoc teams , 2010, AAMAS.

[19]  A. Banerjee,et al.  Improving immunisation coverage in rural India: clustered randomised controlled evaluation of immunisation campaigns with and without incentives , 2010, BMJ : British Medical Journal.

[20]  Sarit Kraus,et al.  Ad Hoc Autonomous Agent Teams: Collaboration without Pre-Coordination , 2010, AAAI.

[21]  Thomas S. Richardson,et al.  Proceedings of the Twenty-Second Conference on Uncertainty in Artificial Intelligence (2006) , 2012, ArXiv.