Policy teaching through reward function learning

Policy teaching considers a Markov Decision Process setting in which an interested party aims to influence an agent's decisions by providing limited incentives. In this paper, we consider the specific objective of inducing a pre-specified desired policy. We examine both the case in which the agent's reward function is known and unknown to the interested party, presenting a linear program for the former case and formulating an active, indirect elicitation method for the latter. We provide conditions for logarithmic convergence, and present a polynomial time algorithm that ensures logarithmic convergence with arbitrarily high probability. We also offer practical elicitation heuristics that can be formulated as linear programs, and demonstrate their effectiveness on a policy teaching problem in a simulated ad-network setting. We extend our methods to handle partial observations and partial target policies, and provide a game-theoretic interpretation of our methods for handling strategic agents.

[1]  Craig Boutilier,et al.  Regret-based Utility Elicitation in Constraint-based Decision Problems , 2005, IJCAI.

[2]  Santosh S. Vempala,et al.  Solving convex programs by random walks , 2004, JACM.

[3]  Sven Rady,et al.  Optimal Experimentation in a Changing Environment , 1997 .

[4]  Andrew Y. Ng,et al.  Policy Invariance Under Reward Transformations: Theory and Application to Reward Shaping , 1999, ICML.

[5]  Craig Boutilier,et al.  Incremental utility elicitation with minimax regret decision criterion , 2003, IJCAI 2003.

[6]  Eyal Amir,et al.  Bayesian Inverse Reinforcement Learning , 2007, IJCAI.

[7]  Craig Boutilier,et al.  New Approaches to Optimization and Utility Elicitation in Autonomic Computing , 2005, AAAI.

[8]  Mia Stern,et al.  Applications of AI in education , 1996, CROS.

[9]  Moshe Tennenholtz,et al.  k-Implementation , 2003, EC '03.

[10]  T. Mulgan The Contract Theory , 2006 .

[11]  C. Boutilier,et al.  Accelerating Reinforcement Learning through Implicit Imitation , 2003, J. Artif. Intell. Res..

[12]  Krzysztof Z. Gajos,et al.  Preference elicitation for interface optimization , 2005, UIST.

[13]  Jesse Hoey,et al.  A planning system based on Markov decision processes to guide people with dementia through activities of daily living , 2006, IEEE Transactions on Information Technology in Biomedicine.

[14]  Noam Nisan,et al.  Proceedings of the 4th ACM conference on Electronic commerce , 2003 .

[15]  Tuomas Sandholm,et al.  Preference elicitation in combinatorial auctions , 2001, AAMAS '02.

[16]  David C. Parkes,et al.  A General Approach to Environment Design with One Agent , 2009, IJCAI.

[17]  Robert J. Vanderbei,et al.  Linear Programming: Foundations and Extensions , 1998, Kluwer international series in operations research and management service.

[18]  Andrew Y. Ng,et al.  Pharmacokinetics of a novel formulation of ivermectin after administration to goats , 2000, ICML.

[19]  Pieter Abbeel,et al.  Apprenticeship learning via inverse reinforcement learning , 2004, ICML.

[20]  H. Varian Revealed Preference , 2006 .

[21]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[22]  Moshe Babaioff,et al.  Mixed Strategies in Combinatorial Agency , 2006, WINE.

[23]  Moshe Babaioff,et al.  Algorithmic Game Theory: Incentives in Peer-to-Peer Systems , 2007 .

[24]  Craig Boutilier,et al.  A Bayesian Approach to Imitation in Reinforcement Learning , 2003, IJCAI.

[25]  Moshe Babaioff,et al.  Combinatorial agency , 2006, EC '06.

[26]  B. Grünbaum Partitions of mass-distributions and of convex bodies by hyperplanes. , 1960 .

[27]  Scott Shenker,et al.  Hidden-action in multi-hop routing , 2005, EC '05.

[28]  Krzysztof Z. Gajos,et al.  Automatically generating custom user interfaces for users with physical disabilities , 2006, Assets '06.

[29]  Daphne Koller,et al.  Learning an Agent's Utility Function by Observing Behavior , 2001, ICML.

[30]  D. Bergemann,et al.  Learning and Strategic Pricing , 1996 .

[31]  Craig Boutilier,et al.  Eliciting Bid Taker Non-price Preferences in (Combinatorial) Auctions , 2004, AAAI.

[32]  Luis Rademacher,et al.  Approximating the centroid is hard , 2007, SCG '07.

[33]  Craig Boutilier,et al.  Constraint-Based Optimization with the Minimax Decision Criterion , 2003, CP.

[34]  Rajesh P. N. Rao,et al.  A Probabilistic Framework for Model-Based Imitation Learning , 2004 .

[35]  David C. Parkes,et al.  Value-Based Policy Teaching with Active Indirect Elicitation , 2008, AAAI.

[36]  Craig Boutilier,et al.  A POMDP formulation of preference elicitation problems , 2002, AAAI/IAAI.

[37]  Hao Zhang,et al.  A Dynamic Principal-Agent Model with Hidden Information: Sequential Optimality Through Truthful State Revelation , 2008, Oper. Res..

[38]  Nicole Immorlica,et al.  Game-Theoretic Aspects of Designing Hyperlink Structures , 2006, WINE.

[39]  Daphne Koller,et al.  Making Rational Decisions Using Adaptive Utility Elicitation , 2000, AAAI/IAAI.