Learning contact-rich manipulation skills with guided policy search

Autonomous learning of object manipulation skills can enable robots to acquire rich behavioral repertoires that scale to the variety of objects found in the real world. However, current motion skill learning methods typically restrict the behavior to a compact, low-dimensional representation, limiting its expressiveness and generality. In this paper, we extend a recently developed policy search method [1] and use it to learn a range of dynamic manipulation behaviors with highly general policy representations, without using known models or example demonstrations. Our approach learns a set of trajectories for the desired motion skill by using iteratively refitted time-varying linear models, and then unifies these trajectories into a single control policy that can generalize to new situations. To enable this method to run on a real robot, we introduce several improvements that reduce the sample count and automate parameter selection. We show that our method can acquire fast, fluent behaviors after only minutes of interaction time, and can learn robust controllers for complex tasks, including putting together a toy airplane, stacking tight-fitting lego blocks, placing wooden rings onto tight-fitting pegs, inserting a shoe tree into a shoe, and screwing bottle caps onto bottles.

[1]  Roderic A. Grupen,et al.  Learning reactive admittance control , 1992, Proceedings 1992 IEEE International Conference on Robotics and Automation.

[2]  Jun Nakanishi,et al.  Learning Attractor Landscapes for Learning Motor Primitives , 2002, NIPS.

[3]  Jeff G. Schneider,et al.  Covariant policy search , 2003, IJCAI 2003.

[4]  Peter Stone,et al.  Policy gradient reinforcement learning for fast quadrupedal locomotion , 2004, IEEE International Conference on Robotics and Automation, 2004. Proceedings. ICRA '04. 2004.

[5]  H. Sebastian Seung,et al.  Stochastic policy gradient reinforcement learning on a simple 3D biped , 2004, 2004 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (IEEE Cat. No.04CH37566).

[6]  Florentin Wörgötter,et al.  Fast biped walking with a reflexive controller and real-time policy searching , 2005, NIPS.

[7]  A.G. Alleyne,et al.  A survey of iterative learning control , 2006, IEEE Control Systems.

[8]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[9]  Aude Billard,et al.  Reinforcement learning for imitating constrained reaching movements , 2007, Adv. Robotics.

[10]  Martin A. Riedmiller,et al.  Neural Reinforcement Learning Controllers for a Real Robot Application , 2007, Proceedings 2007 IEEE International Conference on Robotics and Automation.

[11]  Jun Morimoto,et al.  Learning CPG-based Biped Locomotion with a Policy Gradient Method: Application to a Humanoid Robot , 2005, 5th IEEE-RAS International Conference on Humanoid Robots, 2005..

[12]  KasabovNikola,et al.  2008 Special issue , 2008 .

[13]  Stefan Schaal,et al.  2008 Special Issue: Reinforcement learning of motor skills with policy gradients , 2008 .

[14]  Stefan Schaal,et al.  Learning and generalization of motor skills by learning from demonstration , 2009, 2009 IEEE International Conference on Robotics and Automation.

[15]  Stefan Schaal,et al.  Reinforcement learning of motor skills in high dimensions: A path integral approach , 2010, 2010 IEEE International Conference on Robotics and Automation.

[16]  Aude Billard,et al.  BM: An iterative algorithm to learn stable non-linear dynamical systems with Gaussian mixture models , 2010, 2010 IEEE International Conference on Robotics and Automation.

[17]  Yasemin Altun,et al.  Relative Entropy Policy Search , 2010 .

[18]  Carl E. Rasmussen,et al.  Learning to Control a Low-Cost Manipulator using Data-Efficient Reinforcement Learning , 2011, Robotics: Science and Systems.

[19]  Geoffrey J. Gordon,et al.  A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning , 2010, AISTATS.

[20]  Stefan Schaal,et al.  Skill learning and task outcome prediction for manipulation , 2011, 2011 IEEE International Conference on Robotics and Automation.

[21]  Jan Peters,et al.  Reinforcement Learning to Adjust Robot Movements to New Situations , 2010, IJCAI.

[22]  Stefan Schaal,et al.  Learning force control policies for compliant manipulation , 2011, 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[23]  Sergey Levine,et al.  Exploring Deep and Recurrent Architectures for Optimal Control , 2013, ArXiv.

[24]  Sergey Levine,et al.  Guided Policy Search , 2013, ICML.

[25]  Jan Peters,et al.  Reinforcement learning in robotics: A survey , 2013, Int. J. Robotics Res..

[26]  Jan Peters,et al.  A Survey on Policy Search for Robotics , 2013, Found. Trends Robotics.

[27]  Sergey Levine,et al.  Variational Policy Search via Trajectory Optimization , 2013, NIPS.

[28]  Sergey Levine,et al.  Learning Complex Neural Network Policies with Trajectory Optimization , 2014, ICML.

[29]  Jan Peters,et al.  Sample-based informationl-theoretic stochastic optimal control , 2014, 2014 IEEE International Conference on Robotics and Automation (ICRA).

[30]  Emanuel Todorov,et al.  Combining the benefits of function approximation and trajectory optimization , 2014, Robotics: Science and Systems.

[31]  Sergey Levine,et al.  Learning Neural Network Policies with Guided Policy Search under Unknown Dynamics , 2014, NIPS.