Preference elicitation and inverse reinforcement learning

We state the problem of inverse reinforcement learning in terms of preference elicitation, resulting in a principled (Bayesian) statistical formulation. This generalises previous work on Bayesian inverse reinforcement learning and allows us to obtain a posterior distribution on the agent's preferences, policy and optionally, the obtained reward sequence, from observations. We examine the relation of the resulting approach to other statistical methods for inverse reinforcement learning via analysis and experimental results. We show that preferences can be determined accurately, even if the observed agent's policy is sub-optimal with respect to its own preferences. In that case, significantly improved policies with respect to the agent's preferences are obtained, compared to both other methods and to the performance of the demonstrated policy.

[1]  Scott Sanner,et al.  Real-time Multiattribute Bayesian Preference Elicitation with Pairwise Comparison Queries , 2010, AISTATS.

[2]  Robert E. Schapire,et al.  A Game-Theoretic Approach to Apprenticeship Learning , 2007, NIPS.

[3]  Scott Sanner,et al.  Gaussian Process Preference Elicitation , 2010, NIPS.

[4]  Wei Chu,et al.  Preference learning with Gaussian processes , 2005, ICML.

[5]  M. Degroot Optimal Statistical Decisions , 1970 .

[6]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[7]  Jesse Hoey,et al.  An analytic solution to discrete Bayesian reinforcement learning , 2006, ICML.

[8]  Christos Dimitrakakis,et al.  Bayesian Multitask Inverse Reinforcement Learning , 2011, EWRL.

[9]  L. J. Savage,et al.  The Expected-Utility Hypothesis and the Measurability of Utility , 1952, Journal of Political Economy.

[10]  Craig Boutilier,et al.  Preference Elicitation and Generalized Additive Utility , 2006, AAAI.

[11]  Anind K. Dey,et al.  Modeling Interaction via the Principle of Maximum Causal Entropy , 2010, ICML.

[12]  Andrew Y. Ng,et al.  Pharmacokinetics of a novel formulation of ivermectin after administration to goats , 2000, ICML.

[13]  V. Rich Personal communication , 1989, Nature.

[14]  Pieter Abbeel,et al.  Apprenticeship learning via inverse reinforcement learning , 2004, ICML.

[15]  David Barber,et al.  Variational methods for Reinforcement Learning , 2010, AISTATS.

[16]  Eyal Amir,et al.  Bayesian Inverse Reinforcement Learning , 2007, IJCAI.

[17]  Andrew G. Barto,et al.  Optimal learning: computational procedures for bayes-adaptive markov decision processes , 2002 .

[18]  Hoon Kim,et al.  Monte Carlo Statistical Methods , 2000, Technometrics.

[19]  Dana H. Ballard,et al.  Modular models of task based visually guided behavior , 2009 .

[20]  A. Dawid,et al.  Game theory, maximum entropy, minimum discrepancy and robust Bayesian decision theory , 2004, math/0410076.

[21]  Craig Boutilier,et al.  A POMDP formulation of preference elicitation problems , 2002, AAAI/IAAI.

[22]  Robert E. Schapire,et al.  A Reduction from Apprenticeship Learning to Classification , 2010, NIPS.