Exploration and apprenticeship learning in reinforcement learning

We consider reinforcement learning in systems with unknown dynamics. Algorithms such as E3 (Kearns and Singh, 2002) learn near-optimal policies by using "exploration policies" to drive the system towards poorly modeled states, so as to encourage exploration. But this makes these algorithms impractical for many systems; for example, on an autonomous helicopter, overly aggressive exploration may well result in a crash. In this paper, we consider the apprenticeship learning setting in which a teacher demonstration of the task is available. We show that, given the initial demonstration, no explicit exploration is necessary, and we can attain near-optimal performance (compared to the teacher) simply by repeatedly executing "exploitation policies" that try to maximize rewards. In finite-state MDPs, our algorithm scales polynomially in the number of states; in continuous-state linear dynamical systems, it scales polynomially in the dimension of the state. These results are proved using a martingale construction over relative losses.

[1]  P. Billingsley,et al.  Probability and Measure , 1980 .

[2]  J. Adams Learning of movement sequences , 1984 .

[3]  Dean Pomerleau,et al.  ALVINN, an autonomous land vehicle in a neural network , 2015 .

[4]  B. Anderson,et al.  Optimal control: linear quadratic methods , 1990 .

[5]  R. Durrett Probability: Theory and Examples , 1993 .

[6]  David Williams,et al.  Probability with Martingales , 1991, Cambridge mathematical textbooks.

[7]  Masayuki Inaba,et al.  Learning by watching: extracting reusable task knowledge from visual observation of human performance , 1994, IEEE Trans. Robotics Autom..

[8]  Stefan Schaal,et al.  Robot learning by nonparametric regression , 1994, Proceedings of IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS'94).

[9]  Gillian M. Hayes,et al.  A Robot Controller Using Learning by Imitation , 1994 .

[10]  Andrew W. Moore,et al.  Reinforcement Learning: A Survey , 1996, J. Artif. Intell. Res..

[11]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[12]  Andrew G. Barto,et al.  Reinforcement learning , 1998 .

[13]  S. King Learning to fly. , 1998, Nursing times.

[14]  Michael Kearns,et al.  Efficient Reinforcement Learning in Factored MDPs , 1999, IJCAI.

[15]  Leslie Pack Kaelbling,et al.  Practical Reinforcement Learning in Continuous Spaces , 2000, ICML.

[16]  R. Amit,et al.  Learning movement sequences from demonstration , 2002, Proceedings 2nd International Conference on Development and Learning. ICDL 2002.

[17]  Ronen I. Brafman,et al.  R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning , 2001, J. Mach. Learn. Res..

[18]  John Langford,et al.  Exploration in Metric State Spaces , 2003, ICML.

[19]  Ben Tse,et al.  Autonomous Inverted Helicopter Flight via Reinforcement Learning , 2004, ISER.

[20]  Sham M. Kakade,et al.  Online Bounds for Bayesian Algorithms , 2004, NIPS.

[21]  Michael Kearns,et al.  Near-Optimal Reinforcement Learning in Polynomial Time , 1998, Machine Learning.

[22]  Pieter Abbeel,et al.  Apprenticeship learning via inverse reinforcement learning , 2004, ICML.

[23]  Richard S. Sutton,et al.  Reinforcement Learning , 1992, Handbook of Machine Learning.