论文信息 - Exploration and apprenticeship learning in reinforcement learning

Exploration and apprenticeship learning in reinforcement learning

We consider reinforcement learning in systems with unknown dynamics. Algorithms such as E3 (Kearns and Singh, 2002) learn near-optimal policies by using "exploration policies" to drive the system towards poorly modeled states, so as to encourage exploration. But this makes these algorithms impractical for many systems; for example, on an autonomous helicopter, overly aggressive exploration may well result in a crash. In this paper, we consider the apprenticeship learning setting in which a teacher demonstration of the task is available. We show that, given the initial demonstration, no explicit exploration is necessary, and we can attain near-optimal performance (compared to the teacher) simply by repeatedly executing "exploitation policies" that try to maximize rewards. In finite-state MDPs, our algorithm scales polynomially in the number of states; in continuous-state linear dynamical systems, it scales polynomially in the dimension of the state. These results are proved using a martingale construction over relative losses.

Pieter Abbeel | Andrew Y. Ng | A. Ng | P. Abbeel

[1] P. Billingsley,et al. Probability and Measure , 1980 .

[2] J. Adams. Learning of movement sequences , 1984 .

[3] Dean Pomerleau,et al. ALVINN, an autonomous land vehicle in a neural network , 2015 .

[4] B. Anderson,et al. Optimal control: linear quadratic methods , 1990 .

[5] R. Durrett. Probability: Theory and Examples , 1993 .

[6] David Williams,et al. Probability with Martingales , 1991, Cambridge mathematical textbooks.

[7] Masayuki Inaba,et al. Learning by watching: extracting reusable task knowledge from visual observation of human performance , 1994, IEEE Trans. Robotics Autom..

[8] Stefan Schaal,et al. Robot learning by nonparametric regression , 1994, Proceedings of IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS'94).

[9] Gillian M. Hayes,et al. A Robot Controller Using Learning by Imitation , 1994 .

[10] Andrew W. Moore,et al. Reinforcement Learning: A Survey , 1996, J. Artif. Intell. Res..

[11] John N. Tsitsiklis,et al. Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[12] Andrew G. Barto,et al. Reinforcement learning , 1998 .

[13] S. King. Learning to fly. , 1998, Nursing times.

[14] Michael Kearns,et al. Efficient Reinforcement Learning in Factored MDPs , 1999, IJCAI.

[15] Leslie Pack Kaelbling,et al. Practical Reinforcement Learning in Continuous Spaces , 2000, ICML.

[16] R. Amit,et al. Learning movement sequences from demonstration , 2002, Proceedings 2nd International Conference on Development and Learning. ICDL 2002.

[17] Ronen I. Brafman,et al. R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning , 2001, J. Mach. Learn. Res..

[18] John Langford,et al. Exploration in Metric State Spaces , 2003, ICML.

[19] Ben Tse,et al. Autonomous Inverted Helicopter Flight via Reinforcement Learning , 2004, ISER.

[20] Sham M. Kakade,et al. Online Bounds for Bayesian Algorithms , 2004, NIPS.

[21] Michael Kearns,et al. Near-Optimal Reinforcement Learning in Polynomial Time , 1998, Machine Learning.

[22] Pieter Abbeel,et al. Apprenticeship learning via inverse reinforcement learning , 2004, ICML.

[23] Richard S. Sutton,et al. Reinforcement Learning , 1992, Handbook of Machine Learning.