论文信息 - Policy teaching through reward function learning - 字舞流文

Policy teaching through reward function learning

Policy teaching considers a Markov Decision Process setting in which an interested party aims to influence an agent's decisions by providing limited incentives. In this paper, we consider the specific objective of inducing a pre-specified desired policy. We examine both the case in which the agent's reward function is known and unknown to the interested party, presenting a linear program for the former case and formulating an active, indirect elicitation method for the latter. We provide conditions for logarithmic convergence, and present a polynomial time algorithm that ensures logarithmic convergence with arbitrarily high probability. We also offer practical elicitation heuristics that can be formulated as linear programs, and demonstrate their effectiveness on a policy teaching problem in a simulated ad-network setting. We extend our methods to handle partial observations and partial target policies, and provide a game-theoretic interpretation of our methods for handling strategic agents.

David C. Parkes | Haoqi Zhang | Yiling Chen | D. Parkes | Yiling Chen | Haoqi Zhang

[1] Craig Boutilier,et al. Regret-based Utility Elicitation in Constraint-based Decision Problems , 2005, IJCAI.

[2] Santosh S. Vempala,et al. Solving convex programs by random walks , 2004, JACM.

[3] Sven Rady,et al. Optimal Experimentation in a Changing Environment , 1997 .

[4] Andrew Y. Ng,et al. Policy Invariance Under Reward Transformations: Theory and Application to Reward Shaping , 1999, ICML.

[5] Craig Boutilier,et al. Incremental utility elicitation with minimax regret decision criterion , 2003, IJCAI 2003.

[6] Eyal Amir,et al. Bayesian Inverse Reinforcement Learning , 2007, IJCAI.

[7] Craig Boutilier,et al. New Approaches to Optimization and Utility Elicitation in Autonomic Computing , 2005, AAAI.

[8] Mia Stern,et al. Applications of AI in education , 1996, CROS.

[9] Moshe Tennenholtz,et al. k-Implementation , 2003, EC '03.

[10] T. Mulgan. The Contract Theory , 2006 .

[11] C. Boutilier,et al. Accelerating Reinforcement Learning through Implicit Imitation , 2003, J. Artif. Intell. Res..

[12] Krzysztof Z. Gajos,et al. Preference elicitation for interface optimization , 2005, UIST.

[13] Jesse Hoey,et al. A planning system based on Markov decision processes to guide people with dementia through activities of daily living , 2006, IEEE Transactions on Information Technology in Biomedicine.

[14] Noam Nisan,et al. Proceedings of the 4th ACM conference on Electronic commerce , 2003 .

[15] Tuomas Sandholm,et al. Preference elicitation in combinatorial auctions , 2001, AAMAS '02.

[16] David C. Parkes,et al. A General Approach to Environment Design with One Agent , 2009, IJCAI.

[17] Robert J. Vanderbei,et al. Linear Programming: Foundations and Extensions , 1998, Kluwer international series in operations research and management service.

[18] Andrew Y. Ng,et al. Pharmacokinetics of a novel formulation of ivermectin after administration to goats , 2000, ICML.

[19] Pieter Abbeel,et al. Apprenticeship learning via inverse reinforcement learning , 2004, ICML.

[20] H. Varian. Revealed Preference , 2006 .

[21] Martin L. Puterman,et al. Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[22] Moshe Babaioff,et al. Mixed Strategies in Combinatorial Agency , 2006, WINE.

[23] Moshe Babaioff,et al. Algorithmic Game Theory: Incentives in Peer-to-Peer Systems , 2007 .

[24] Craig Boutilier,et al. A Bayesian Approach to Imitation in Reinforcement Learning , 2003, IJCAI.

[25] Moshe Babaioff,et al. Combinatorial agency , 2006, EC '06.

[26] B. Grünbaum. Partitions of mass-distributions and of convex bodies by hyperplanes. , 1960 .

[27] Scott Shenker,et al. Hidden-action in multi-hop routing , 2005, EC '05.

[28] Krzysztof Z. Gajos,et al. Automatically generating custom user interfaces for users with physical disabilities , 2006, Assets '06.

[29] Daphne Koller,et al. Learning an Agent's Utility Function by Observing Behavior , 2001, ICML.

[30] D. Bergemann,et al. Learning and Strategic Pricing , 1996 .

[31] Craig Boutilier,et al. Eliciting Bid Taker Non-price Preferences in (Combinatorial) Auctions , 2004, AAAI.

[32] Luis Rademacher,et al. Approximating the centroid is hard , 2007, SCG '07.

[33] Craig Boutilier,et al. Constraint-Based Optimization with the Minimax Decision Criterion , 2003, CP.

[34] Rajesh P. N. Rao,et al. A Probabilistic Framework for Model-Based Imitation Learning , 2004 .

[35] David C. Parkes,et al. Value-Based Policy Teaching with Active Indirect Elicitation , 2008, AAAI.

[36] Craig Boutilier,et al. A POMDP formulation of preference elicitation problems , 2002, AAAI/IAAI.

[37] Hao Zhang,et al. A Dynamic Principal-Agent Model with Hidden Information: Sequential Optimality Through Truthful State Revelation , 2008, Oper. Res..

[38] Nicole Immorlica,et al. Game-Theoretic Aspects of Designing Hyperlink Structures , 2006, WINE.

[39] Daphne Koller,et al. Making Rational Decisions Using Adaptive Utility Elicitation , 2000, AAAI/IAAI.