End-to-End Training of Deep Visuomotor Policies

Policy search methods can allow robots to learn control policies for a wide range of tasks, but practical applications of policy search often require hand-engineered components for perception, state estimation, and low-level control. In this paper, we aim to answer the following question: does training the perception and control systems jointly end-to-end provide better performance than training each component separately? To this end, we develop a method that can be used to learn policies that map raw image observations directly to torques at the robot's motors. The policies are represented by deep convolutional neural networks (CNNs) with 92,000 parameters, and are trained using a partially observed guided policy search method, which transforms policy search into supervised learning, with supervision provided by a simple trajectory-centric reinforcement learning method. We evaluate our method on a range of real-world manipulation tasks that require close coordination between vision and control, such as screwing a cap onto a bottle, and present simulated comparisons to a range of prior policy search methods.

[1]  R. Bellman Dynamic programming. , 1957, Science.

[2]  David Q. Mayne,et al.  Differential dynamic programming , 1972, The Mathematical Gazette.

[3]  Dean Pomerleau,et al.  ALVINN, an autonomous land vehicle in a neural network , 2015 .

[4]  Lawrence D. Jackel,et al.  Handwritten Digit Recognition with a Back-Propagation Network , 1989, NIPS.

[5]  Vijaykumar Gullapalli,et al.  A stochastic reinforcement learning algorithm for learning real-valued functions , 1990, Neural Networks.

[6]  Peter J. Gawthrop,et al.  Neural networks for control systems - A survey , 1992, Autom..

[7]  Patrick Rives,et al.  A new approach to visual servoing in robotics , 1992, IEEE Trans. Robotics Autom..

[8]  Ken Goldberg,et al.  Neural Networks in Robotics , 1993 .

[9]  Peter K. Allen,et al.  Active, uncalibrated visual servoing , 1994, Proceedings of the 1994 IEEE International Conference on Robotics and Automation.

[10]  Vijaykumar Gullapalli,et al.  Skillful control under uncertainty via direct reinforcement learning , 1995, Robotics Auton. Syst..

[11]  William J. Wilson,et al.  Relative end-effector control using Cartesian position based visual servoing , 1996, IEEE Trans. Robotics Autom..

[12]  Judy A. Franklin,et al.  Biped dynamic walking using reinforcement learning , 1997, Robotics Auton. Syst..

[13]  Olac Fuentes,et al.  Experimental evaluation of uncalibrated visual servoing for precision manipulation , 1997, Proceedings of International Conference on Robotics and Automation.

[14]  Frank L. Lewis,et al.  Neural Network Control Of Robot Manipulators And Non-Linear Systems , 1998 .

[15]  Yoshua Bengio,et al.  Gradient Flow in Recurrent Nets: the Difficulty of Learning Long-Term Dependencies , 2001 .

[16]  Patrice Y. Simard,et al.  Best practices for convolutional neural networks applied to visual document analysis , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[17]  Jeff G. Schneider,et al.  Covariant policy search , 2003, IJCAI 2003.

[18]  Jürgen Schmidhuber,et al.  A robot that reinforcement-learns to identify and memorize important previous observations , 2003, Proceedings 2003 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2003) (Cat. No.03CH37453).

[19]  Peter Stone,et al.  Policy gradient reinforcement learning for fast quadrupedal locomotion , 2004, IEEE International Conference on Robotics and Automation, 2004. Proceedings. ICRA '04. 2004.

[20]  Emanuel Todorov,et al.  Iterative Linear Quadratic Regulator Design for Nonlinear Biological Movement Systems , 2004, ICINCO.

[21]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[22]  H. Sebastian Seung,et al.  Stochastic policy gradient reinforcement learning on a simple 3D biped , 2004, 2004 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (IEEE Cat. No.04CH37566).

[23]  Ben Tse,et al.  Autonomous Inverted Helicopter Flight via Reinforcement Learning , 2004, ISER.

[24]  Kunihiko Fukushima,et al.  Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position , 1980, Biological Cybernetics.

[25]  Dirk P. Kroese,et al.  The Cross-Entropy Method: A Unified Approach to Combinatorial Optimization, Monte-Carlo Simulation and Machine Learning , 2004 .

[26]  A. Lanfranco,et al.  Robotic Surgery: A Current Perspective , 2004, Annals of surgery.

[27]  Dirk P. Kroese,et al.  The Cross Entropy Method: A Unified Approach To Combinatorial Optimization, Monte-carlo Simulation (Information Science and Statistics) , 2004 .

[28]  Peter Szabó,et al.  Learning to Control an Octopus Arm with Gaussian Process Temporal Difference Methods , 2005, NIPS.

[29]  Florentin Wörgötter,et al.  Fast biped walking with a reflexive controller and real-time policy searching , 2005, NIPS.

[30]  Jürgen Schmidhuber,et al.  A System for Robotic Heart Surgery that Learns to Tie Knots Using Recurrent Neural Networks , 2006 .

[31]  Silvio Savarese,et al.  3D generic object categorization, localization and pose estimation , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[32]  Stefan Schaal,et al.  Applying the Episodic Natural Actor-Critic Architecture to Motor Primitive Learning , 2007, ESANN.

[33]  Justus H. Piater,et al.  Closed-Loop Learning of Visual Control Policies , 2011, J. Artif. Intell. Res..

[34]  John Kenneth Salisbury,et al.  Towards a personal robotics development platform: Rationale and design of an intrinsically safe personal robot , 2008, 2008 IEEE International Conference on Robotics and Automation.

[35]  Jun Morimoto,et al.  Learning CPG-based Biped Locomotion with a Policy Gradient Method: Application to a Humanoid Robot , 2005, 5th IEEE-RAS International Conference on Humanoid Robots, 2005..

[36]  Stefan Schaal,et al.  2008 Special Issue: Reinforcement learning of motor skills with policy gradients , 2008 .

[37]  Jan Peters,et al.  Learning motor primitives for robotics , 2009, 2009 IEEE International Conference on Robotics and Automation.

[38]  Honglak Lee,et al.  Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations , 2009, ICML '09.

[39]  Stefan Schaal,et al.  Learning and generalization of motor skills by learning from demonstration , 2009, 2009 IEEE International Conference on Robotics and Automation.

[40]  Urs A. Muller,et al.  Learning long-range vision for autonomous off-road driving , 2009 .

[41]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[42]  Stefan Schaal,et al.  Reinforcement learning of motor skills in high dimensions: A path integral approach , 2010, 2010 IEEE International Conference on Robotics and Automation.

[43]  Aude Billard,et al.  BM: An iterative algorithm to learn stable non-linear dynamical systems with Gaussian mixture models , 2010, 2010 IEEE International Conference on Robotics and Automation.

[44]  Christoph H. Lampert,et al.  Movement templates for learning of hitting and batting , 2010, 2010 IEEE International Conference on Robotics and Automation.

[45]  Yasemin Altun,et al.  Relative Entropy Policy Search , 2010 .

[46]  Kurt Konolige,et al.  Autonomous door opening and plugging in with a personal robot , 2010, 2010 IEEE International Conference on Robotics and Automation.

[47]  Derek Hoiem,et al.  Category Independent Object Proposals , 2010, ECCV.

[48]  Carl E. Rasmussen,et al.  Learning to Control a Low-Cost Manipulator using Data-Efficient Reinforcement Learning , 2011, Robotics: Science and Systems.

[49]  Luca Maria Gambardella,et al.  Flexible, High Performance Convolutional Neural Networks for Image Classification , 2011, IJCAI.

[50]  Carl E. Rasmussen,et al.  PILCO: A Model-Based and Data-Efficient Approach to Policy Search , 2011, ICML.

[51]  Geoffrey J. Gordon,et al.  A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning , 2010, AISTATS.

[52]  Jan Peters,et al.  Reinforcement Learning to Adjust Robot Movements to New Situations , 2010, IJCAI.

[53]  Stephen P. Boyd,et al.  Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers , 2011, Found. Trends Mach. Learn..

[54]  Stefan Schaal,et al.  Learning force control policies for compliant manipulation , 2011, 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[55]  Jürgen Schmidhuber,et al.  Multi-column deep neural networks for image classification , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[56]  Martin A. Riedmiller,et al.  Autonomous reinforcement learning on raw visual input data in a real world application , 2012, The 2012 International Joint Conference on Neural Networks (IJCNN).

[57]  Dong Yu,et al.  Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[58]  Olivier Sigaud,et al.  Path Integral Policy Improvement with Covariance Matrix Adaptation , 2012, ICML.

[59]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[60]  Yuval Tassa,et al.  MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[61]  Peter V. Gehler,et al.  Teaching 3D geometry to deformable part models , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[62]  Martial Hebert,et al.  Learning monocular reactive UAV control in cluttered natural environments , 2012, 2013 IEEE International Conference on Robotics and Automation.

[63]  Alex Graves,et al.  Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[64]  Sergey Levine,et al.  Guided Policy Search , 2013, ICML.

[65]  Koen E. A. van de Sande,et al.  Selective Search for Object Recognition , 2013, International Journal of Computer Vision.

[66]  Jürgen Schmidhuber,et al.  Evolving large-scale neural networks for vision-based reinforcement learning , 2013, GECCO '13.

[67]  Jan Peters,et al.  Reinforcement learning in robotics: A survey , 2013, Int. J. Robotics Res..

[68]  Razvan Pascanu,et al.  On the difficulty of training recurrent neural networks , 2012, ICML.

[69]  Jan Peters,et al.  A Survey on Policy Search for Robotics , 2013, Found. Trends Robotics.

[70]  Sergey Levine,et al.  Variational Policy Search via Trajectory Optimization , 2013, NIPS.

[71]  Martin A. Riedmiller,et al.  Acquiring visual servoing reaching and grasping skills using neural reinforcement learning , 2013, The 2013 International Joint Conference on Neural Networks (IJCNN).

[72]  Klaus Obermayer,et al.  Construction of approximation spaces for reinforcement learning , 2013, J. Mach. Learn. Res..

[73]  Sergey Levine,et al.  Learning Complex Neural Network Policies with Trajectory Optimization , 2014, ICML.

[74]  Jan Peters,et al.  Sample-based informationl-theoretic stochastic optimal control , 2014, 2014 IEEE International Conference on Robotics and Automation (ICRA).

[75]  Jonathan Tompson,et al.  Joint Training of a Convolutional Network and a Graphical Model for Human Pose Estimation , 2014, NIPS.

[76]  Emanuel Todorov,et al.  Combining the benefits of function approximation and trajectory optimization , 2014, Robotics: Science and Systems.

[77]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[78]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[79]  Vijay Kumar,et al.  Vision-based control of a quadrotor for perching on lines , 2014, 2014 IEEE International Conference on Robotics and Automation (ICRA).

[80]  Oliver Brock,et al.  State Representation Learning in Robotics: Using Prior Knowledge about Physical Interaction , 2014, Robotics: Science and Systems.

[81]  Honglak Lee,et al.  Deep Learning for Real-Time Atari Game Play Using Offline Monte-Carlo Tree Search Planning , 2014, NIPS.

[82]  Arindam Banerjee,et al.  Bregman Alternating Direction Method of Multipliers , 2013, NIPS.

[83]  Sergey Levine,et al.  Learning Neural Network Policies with Guided Policy Search under Unknown Dynamics , 2014, NIPS.

[84]  Jan Peters,et al.  Learning of Non-Parametric Control Policies with High-Dimensional State Features , 2015, AISTATS.

[85]  Jürgen Schmidhuber,et al.  Deep learning in neural networks: An overview , 2014, Neural Networks.

[86]  Honglak Lee,et al.  Deep learning for detecting robotic grasps , 2013, Int. J. Robotics Res..

[87]  Sergey Levine,et al.  Learning Visual Feature Spaces for Robotic Manipulation with Deep Spatial Autoencoders , 2015, ArXiv.

[88]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[89]  Martin A. Riedmiller,et al.  Embed to Control: A Locally Linear Latent Dynamics Model for Control from Raw Images , 2015, NIPS.

[90]  Ashutosh Saxena,et al.  Robobarista: Object Part Based Transfer of Manipulation Trajectories from Crowd-Sourcing in 3D Pointclouds , 2015, ISRR.

[91]  Nolan Wagener,et al.  Learning contact-rich manipulation skills with guided policy search , 2015, 2015 IEEE International Conference on Robotics and Automation (ICRA).

[92]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[93]  Ross A. Knepper,et al.  DeepMPC: Learning Deep Latent Features for Model Predictive Control , 2015, Robotics: Science and Systems.

[94]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[95]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[96]  Sergey Levine,et al.  Deep spatial autoencoders for visuomotor learning , 2015, 2016 IEEE International Conference on Robotics and Automation (ICRA).

[97]  Abhinav Gupta,et al.  Supersizing self-supervision: Learning to grasp from 50K tries and 700 robot hours , 2015, 2016 IEEE International Conference on Robotics and Automation (ICRA).

[98]  DarrellTrevor,et al.  End-to-end training of deep visuomotor policies , 2016 .