Apprenticeship learning via inverse reinforcement learning

We consider learning in a Markov decision process where we are not explicitly given a reward function, but where instead we can observe an expert demonstrating the task that we want to learn to perform. This setting is useful in applications (such as the task of driving) where it may be difficult to write down an explicit reward function specifying exactly how different desiderata should be traded off. We think of the expert as trying to maximize a reward function that is expressible as a linear combination of known features, and give an algorithm for learning the task demonstrated by the expert. Our algorithm is based on using "inverse reinforcement learning" to try to recover the unknown reward function. We show that our algorithm terminates in a small number of iterations, and that even though we may never recover the expert's reward function, the policy output by the algorithm will attain performance close to that of the expert, where here performance is measured with respect to the expert's unknown reward function.

[1]  A. S. Manne Linear Programming and Sequential Decisions , 1960 .

[2]  丸山 徹 Convex Analysisの二,三の進展について , 1977 .

[3]  R. Varga,et al.  Proof of Theorem 2 , 1983 .

[4]  N. Hogan An organizing principle for a class of voluntary movements , 1984, The Journal of neuroscience : the official journal of the Society for Neuroscience.

[5]  Dean Pomerleau,et al.  ALVINN, an autonomous land vehicle in a neural network , 2015 .

[6]  宇野 洋二,et al.  Formation and control of optimal trajectory in human multijoint arm movement : minimum torque-change model , 1988 .

[7]  Masayuki Inaba,et al.  Learning by watching: extracting reusable task knowledge from visual observation of human performance , 1994, IEEE Trans. Robotics Autom..

[8]  Gillian M. Hayes,et al.  A Robot Controller Using Learning by Imitation , 1994 .

[9]  Stefan Schaal,et al.  Robot Learning From Demonstration , 1997, ICML.

[10]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[11]  S. Pattinson,et al.  Learning to fly. , 1998 .

[12]  Andrew Y. Ng,et al.  Policy Invariance Under Reward Transformations: Theory and Application to Reward Shaping , 1999, ICML.

[13]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[14]  Andrew Y. Ng,et al.  Pharmacokinetics of a novel formulation of ivermectin after administration to goats , 2000, ICML.

[15]  R. Amit,et al.  Learning movement sequences from demonstration , 2002, Proceedings 2nd International Conference on Development and Learning. ICDL 2002.

[16]  M. Kawato,et al.  Formation and control of optimal trajectory in human multijoint arm movement , 1989, Biological Cybernetics.

[17]  K. Taira Proof of Theorem 1.3 , 2004 .