Noname manuscript No. (will be inserted by the editor) Policy Search for Motor Primitives in Robotics

Many motor skills in humanoid robotics can be learned using parametrized motor primitives. While successful applications to date have been achieved with imitation learning, most of the interesting motor learning problems are high-dimensional reinforcement learning problems. These problems are often beyond the reach of current reinforcement learning methods. In this paper, we study parametrized policy search methods and apply these to benchmark problems of motor primitive learning in robotics. We show that many well-known parametrized policy search methods can be derived from a general, common framework. This framework yields both policy gradient methods and expectation-maximization (EM) inspired algorithms. We introduce a novel EM-inspired algorithm for policy learning that is particularly well-suited for dynamical system motor primitives. We compare this algorithm, both in simulation and on a real robot, to several well-known parametrized policy search methods such as episodic REINFORCE, ‘Vanilla’ Policy Gradients with optimal baselines, episodic Natural Actor Critic, and episodic Reward-Weighted Regression. We show that the proposed method out-performs them on an empirical benchmark of learning dynamical system motor primitives both in simulation and on a real robot. We apply it in the context of motor learning and show that it can learn a complex Ball-in-a-Cup task on a real Barrett WAM™ robot arm.

[1]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[2]  Richard S. Sutton,et al.  Integrated Architectures for Learning, Planning, and Reacting Based on Approximating Dynamic Programming , 1990, ML.

[3]  Christopher G. Atkeson,et al.  Using Local Trajectory Optimizers to Speed Up Global Optimization in Dynamic Programming , 1993, NIPS.

[4]  C. Sumners Toys in Space: Exploring Science with the Astronauts , 1993 .

[5]  Yasuhiro Masutani,et al.  Mastering of a Task with Interaction between a Robot and Its Environment. "Kendama" Task. , 1993 .

[6]  Michael I. Jordan,et al.  MASSACHUSETTS INSTITUTE OF TECHNOLOGY ARTIFICIAL INTELLIGENCE LABORATORY and CENTER FOR BIOLOGICAL AND COMPUTATIONAL LEARNING DEPARTMENT OF BRAIN AND COGNITIVE SCIENCES , 1996 .

[7]  Mitsuo Kawato,et al.  Teaching by Showing in Kendama Based on Optimization Principle , 1994 .

[8]  V. Gullapalli,et al.  Acquiring robot skills via reinforcement learning , 1994, IEEE Control Systems.

[9]  S. Schaal,et al.  A Kendama Learning Robot Based on Bi-directional Theory , 1996, Neural Networks.

[10]  G. McLachlan,et al.  The EM algorithm and extensions , 1996 .

[11]  Geoffrey E. Hinton,et al.  Using Expectation-Maximization for Reinforcement Learning , 1997, Neural Computation.

[12]  Andrew G. Barto,et al.  Reinforcement learning , 1998 .

[13]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[14]  Michael I. Jordan,et al.  PEGASUS: A policy search method for large MDPs and POMDPs , 2000, UAI.

[15]  Jürgen Schmidhuber,et al.  Gradient-based Reinforcement Planning in Policy-Search Methods , 2001, ArXiv.

[16]  Rogelio Lozano,et al.  Non-linear Control for Underactuated Mechanical Systems , 2001 .

[17]  Andrew W. Moore,et al.  Direct Policy Search using Paired Statistical Tests , 2001, ICML.

[18]  Jun Nakanishi,et al.  Learning Attractor Landscapes for Learning Motor Primitives , 2002, NIPS.

[19]  Jun Nakanishi,et al.  Movement imitation with nonlinear dynamical systems in humanoid robots , 2002, Proceedings 2002 IEEE International Conference on Robotics and Automation (Cat. No.02CH37292).

[20]  Leslie Pack Kaelbling,et al.  Reinforcement Learning by Policy Search , 2002 .

[21]  Jun Nakanishi,et al.  Control, Planning, Learning, and Imitation with Dynamic Movement Primitives , 2003 .

[22]  Stefan Schaal,et al.  Reinforcement Learning for Humanoid Robotics , 2003 .

[23]  Jeff G. Schneider,et al.  Covariant policy search , 2003, IJCAI 2003.

[24]  Hagai Attias,et al.  Planning by Probabilistic Inference , 2003, AISTATS.

[25]  Jeff G. Schneider,et al.  Policy Search by Dynamic Programming , 2003, NIPS.

[26]  Noah J. Cowan,et al.  Efficient Gradient Estimation for Motor Control Learning , 2002, UAI.

[27]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[28]  H. Sebastian Seung,et al.  Stochastic policy gradient reinforcement learning on a simple 3D biped , 2004, 2004 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (IEEE Cat. No.04CH37566).

[29]  Nando de Freitas,et al.  An Introduction to MCMC for Machine Learning , 2004, Machine Learning.

[30]  Stefan Schaal,et al.  Scalable Techniques from Nonparametric Statistics for Real Time Robot Learning , 2002, Applied Intelligence.

[31]  Ben Tse,et al.  Autonomous Inverted Helicopter Flight via Reinforcement Learning , 2004, ISER.

[32]  Stuart J. Russell,et al.  Adaptive Probabilistic Networks with Hidden Variables , 1997, Machine Learning.

[33]  Stefan Schaal,et al.  Natural Actor-Critic , 2003, Neurocomputing.

[34]  Emanuel Todorov,et al.  Optimal Control Theory , 2006 .

[35]  Stefan Schaal,et al.  Policy Gradient Methods for Robotics , 2006, 2006 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[36]  Marc Carreras,et al.  Towards Direct Policy Search Reinforcement Learning for Robot Control , 2005, 2006 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[37]  Aude Billard,et al.  Reinforcement learning for imitating constrained reaching movements , 2007, Adv. Robotics.

[38]  Stefan Schaal,et al.  Reinforcement learning by reward-weighted regression for operational space control , 2007, ICML '07.

[39]  Nando de Freitas,et al.  Bayesian Policy Learning with Trans-Dimensional MCMC , 2007, NIPS.

[40]  J. Leeds Attention and Motor Skill Learning , 2007 .

[41]  Shimon Whiteson,et al.  Transfer via inter-task mappings in policy search reinforcement learning , 2007, AAMAS '07.

[42]  Marc Toussaint,et al.  Probabilistic inference for structured planning in robotics , 2007, 2007 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[43]  Stefan Schaal,et al.  Dynamics systems vs. optimal control--a unifying view. , 2007, Progress in brain research.

[44]  Paolo Dario,et al.  Special issue on robotics and neuroscience , 2008, Neural Networks.

[45]  Jürgen Schmidhuber,et al.  State-Dependent Exploration for Policy Gradient Methods , 2008, ECML/PKDD.

[46]  Jan Peters,et al.  Using Bayesian Dynamical Systems for Motion Template Libraries , 2008, NIPS.

[47]  Stefan Schaal,et al.  Movement reproduction and obstacle avoidance with dynamic movement primitives and potential fields , 2008, Humanoids 2008 - 8th IEEE-RAS International Conference on Humanoid Robots.

[48]  Betty J. Mohler,et al.  Learning perceptual coupling for motor primitives , 2008, 2008 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[49]  Jan Peters,et al.  Policy Search for Motor Primitives in Robotics , 2008, NIPS 2008.

[50]  Jan Peters,et al.  Machine Learning for motor skills in robotics , 2008, Künstliche Intell..

[51]  Jan Peters,et al.  Learning motor primitives for robotics , 2009, 2009 IEEE International Conference on Robotics and Automation.

[52]  Javier de Lope,et al.  The kNN-TD Reinforcement Learning Algorithm , 2009 .

[53]  Eric O. Postma,et al.  Dimensionality Reduction: A Comparative Review , 2008 .

[54]  Marc Toussaint,et al.  Learning model-free robot control by a Monte Carlo EM algorithm , 2009, Auton. Robots.

[55]  H. JoséAntonioMartín,et al.  The kNN-TD Reinforcement Learning Algorithm , 2009, IWINAC.

[56]  Stefan Schaal,et al.  Reinforcement learning of motor skills in high dimensions: A path integral approach , 2010, 2010 IEEE International Conference on Robotics and Automation.

[57]  Darwin G. Caldwell,et al.  Robot motor skill coordination with EM-based Reinforcement Learning , 2010, 2010 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[58]  Frank Sehnke,et al.  Parameter-exploring policy gradients , 2010, Neural Networks.