论文信息 - Comparing human-centric and robot-centric sampling for robot deep learning from demonstrations

Comparing human-centric and robot-centric sampling for robot deep learning from demonstrations

Motivated by recent advances in Deep Learning for robot control, this paper considers two learning algorithms in terms of how they acquire demonstrations from fallible human supervisors. Human-Centric (HC) sampling is a standard supervised learning algorithm, where a human supervisor demonstrates the task by teleoperating the robot to provide trajectories consisting of state-control pairs. Robot-Centric (RC) sampling is an increasingly popular alternative used in algorithms such as DAgger, where a human supervisor observes the robot execute a learned policy and provides corrective control labels for each state visited. We suggest RC sampling can be challenging for human supervisors and prone to mislabeling. RC sampling can also induce error in policy performance because it repeatedly visits areas of the state space that are harder to learn. Although policies learned with RC sampling can be superior to HC sampling for standard learning models such as linear SVMs, policies learned with HC sampling may be comparable to RC when applied to expressive learning models such as deep learning and hyper-parametric decision trees, which can achieve very low training error provided there is enough data. We compare HC and RC using a grid world environment and a physical robot singulation task. In the latter the input is a binary image of objects on a planar worksurface and the policy generates a motion in the gripper to separate one object from the rest. We observe in simulation that for linear SVMs, policies learned with RC outperformed those learned with HC but that using deep models this advantage disappears. We also find that with RC, the corrective control labels provided by humans can be highly inconsistent. We prove there exists a class of examples in which at the limit, HC is guaranteed to converge to an optimal policy while RC may fail to converge. These results suggest a form of HC sampling may be preferable for highly-expressive learning models and human supervisors.

[1] E. Slud. Distribution Inequalities for the Binomial Law , 1977 .

[2] Felix Duvallet,et al. Imitation learning for natural language direction following through unknown environments , 2013, 2013 IEEE International Conference on Robotics and Automation.

[3] J. Andrew Bagnell,et al. Efficient Reductions for Imitation Learning , 2010, AISTATS.

[4] Vladimir N. Vapnik,et al. The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[5] Sergey Levine,et al. Variational Policy Search via Trajectory Optimization , 2013, NIPS.

[6] Ambuj Tewari,et al. On the Complexity of Linear Prediction: Risk Bounds, Margin Bounds, and Regularization , 2008, NIPS.

[7] Pieter Abbeel,et al. Learning accurate kinematic control of cable-driven surgical robots using data cleaning and Gaussian Process Regression , 2014, 2014 IEEE International Conference on Automation Science and Engineering (CASE).

[8] Andrea Lockerd Thomaz,et al. Novel Interaction Strategies for Learning from Teleoperation , 2012, AAAI Fall Symposium: Robots Learning Interactively from Human Teachers.

[9] Honglak Lee,et al. Deep Learning for Real-Time Atari Game Play Using Offline Monte-Carlo Tree Search Planning , 2014, NIPS.

[10] Brett Browning,et al. A survey of robot learning from demonstration , 2009, Robotics Auton. Syst..

[11] Anca D. Dragan,et al. SHIV: Reducing supervisor burden in DAgger using support vectors for efficient learning from demonstrations in high dimensional state spaces , 2016, 2016 IEEE International Conference on Robotics and Automation (ICRA).

[12] He He,et al. Imitation Learning by Coaching , 2012, NIPS.

[13] Sergey Levine,et al. End-to-End Training of Deep Visuomotor Policies , 2015, J. Mach. Learn. Res..

[14] Joelle Pineau,et al. Maximum Mean Discrepancy Imitation Learning , 2013, Robotics: Science and Systems.

[15] Peter L. Bartlett,et al. Rademacher and Gaussian Complexities: Risk Bounds and Structural Results , 2003, J. Mach. Learn. Res..

[16] Kyunghyun Cho,et al. Query-Efficient Imitation Learning for End-to-End Autonomous Driving , 2016, ArXiv.

[17] Peter L. Bartlett,et al. Neural Network Learning - Theoretical Foundations , 1999 .

[18] Dean Pomerleau,et al. ALVINN, an autonomous land vehicle in a neural network , 2015 .

[19] Vladimir Vapnik,et al. Principles of Risk Minimization for Learning Theory , 1991, NIPS.

[20] Geoffrey J. Gordon,et al. A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning , 2010, AISTATS.

[21] Sergio Gomez Colmenarejo,et al. Hybrid computing using a neural network with dynamic external memory , 2016, Nature.

[22] Anca D. Dragan,et al. Robot grasping in clutter: Using a hierarchy of supervisors for learning from demonstrations , 2016, 2016 IEEE International Conference on Automation Science and Engineering (CASE).

[23] J. Andrew Bagnell,et al. Reinforcement and Imitation Learning via Interactive No-Regret Learning , 2014, ArXiv.

[24] Sebastian Thrun,et al. Apprenticeship learning for motion planning with application to parking lot navigation , 2008, 2008 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[25] Samy Bengio,et al. Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks , 2015, NIPS.

[26] Maya Cakmak,et al. Keyframe-based Learning from Demonstration , 2012, Int. J. Soc. Robotics.

[27] Ales Ude,et al. Acquisition of Elementary Robot Skills from Human Demonstration , 1995 .

[28] Shai Shalev-Shwartz,et al. Online Learning and Online Convex Optimization , 2012, Found. Trends Mach. Learn..

[29] Pieter Abbeel,et al. Learning from Demonstrations Through the Use of Non-rigid Registration , 2013, ISRR.

[30] Martial Hebert,et al. Learning monocular reactive UAV control in cluttered natural environments , 2012, 2013 IEEE International Conference on Robotics and Automation.

[31] Sergio Verdú,et al. Total variation distance and the distribution of relative information , 2014, 2014 Information Theory and Applications Workshop (ITA).