Learning from Demonstration

By now it is widely accepted that learning a task from scratch, i.e., without any prior knowledge, is a daunting undertaking. Humans, however, rarely attempt to learn from scratch. They extract initial biases as well as strategies how to approach a learning problem from instructions and/or demonstrations of other humans. For teaming control, this paper investigates how learning from demonstration can be applied in the context of reinforcement learning. We consider priming the Q-function, the value function, the policy, and the model of the task dynamics as possible areas where demonstrations can speed up learning. In general nonlinear learning problems, only model-based reinforcement learning shows significant speed-up after a demonstration, while in the special case of linear quadratic regulator (LQR) problems, all methods profit from the demonstration. In an implementation of pole balancing on a complex anthropomorphic robot arm, we demonstrate that, when facing the complexities of real signal processing, model-based reinforcement learning offers the most robustness for LQR problems. Using the suggested methods, the robot learns pole balancing in just a single trial after a 30 second long demonstration of the human instructor.

[1]  M. Ciletti,et al.  The computation and theory of optimal control , 1972 .

[2]  Richard S. Sutton,et al.  Neuronlike adaptive elements that can solve difficult learning control problems , 1983, IEEE Transactions on Systems, Man, and Cybernetics.

[3]  Jean-Claude Latombe,et al.  An Approach to Automatic Robot Programming Based on Inductive Learning , 1984 .

[4]  Gerald DeJong,et al.  Explanation-based manipulator learning: Acquisition of planning ability through observation , 1985, Proceedings. 1985 IEEE International Conference on Robotics and Automation.

[5]  Richard S. Sutton,et al.  Integrated Architectures for Learning, Planning, and Reacting Based on Approximating Dynamic Programming , 1990, ML.

[6]  Andrew W. Moore,et al.  Fast, Robust Adaptive Control by Learning only Forward Models , 1991, NIPS.

[7]  Steven J. Bradtke,et al.  Reinforcement Learning Applied to Linear Quadratic Regulation , 1992, NIPS.

[8]  Christopher G. Atkeson,et al.  Using Local Trajectory Optimizers to Speed Up Global Optimization in Dynamic Programming , 1993, NIPS.

[9]  Mitsuo Kawato,et al.  Teaching by Showing in Kendama Based on Optimization Principle , 1994 .

[10]  Ales Ude,et al.  Acquisition of Elementary Robot Skills from Human Demonstration , 1995 .

[11]  Kenji Doya,et al.  Temporal Difference Learning in Continuous Time and Space , 1995, NIPS.

[12]  Ben J. A. Kröse,et al.  Learning from delayed rewards , 1995, Robotics Auton. Syst..

[13]  Stefan Schaal,et al.  From Isolation to Cooperation: An Alternative View of a System of Experts , 1995, NIPS.

[14]  Katsushi Ikeuchi,et al.  Assembly plan from observation , 1995, Proceedings of 1995 Japan International Electronic Manufacturing Technology Symposium.

[15]  S. Schaal,et al.  A Kendama Learning Robot Based on Bi-directional Theory , 1996, Neural Networks.

[16]  Peter Bakker,et al.  Robot see, robot do: An overview of robot imitation , 1996 .

[17]  Richard S. Sutton,et al.  Reinforcement learning with replacing eligibility traces , 2004, Machine Learning.

[18]  Peter Stone,et al.  Reinforcement learning , 2019, Scholarpedia.