2008 Special Issue: Reinforcement learning of motor skills with policy gradients

Autonomous learning is one of the hallmarks of human and animal behavior, and understanding the principles of learning will be crucial in order to achieve true autonomy in advanced machines like humanoid robots. In this paper, we examine learning of complex motor skills with human-like limbs. While supervised learning can offer useful tools for bootstrapping behavior, e.g., by learning from demonstration, it is only reinforcement learning that offers a general approach to the final trial-and-error improvement that is needed by each individual acquiring a skill. Neither neurobiological nor machine learning studies have, so far, offered compelling results on how reinforcement learning can be scaled to the high-dimensional continuous state and action spaces of humans or humanoids. Here, we combine two recent research developments on learning motor control in order to achieve this scaling. First, we interpret the idea of modular motor control by means of motor primitives as a suitable way to generate parameterized control policies for reinforcement learning. Second, we combine motor primitives with the theory of stochastic policy gradient learning, which currently seems to be the only feasible framework for reinforcement learning for humanoids. We evaluate different policy gradient methods with a focus on their applicability to parameterized motor primitives. We compare these algorithms in the context of motor primitive learning, and show that our most modern algorithm, the Episodic Natural Actor-Critic outperforms previous algorithms by at least an order of magnitude. We demonstrate the efficiency of this reinforcement learning method in the application of learning to hit a baseball with an anthropomorphic robot arm.

[1]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[2]  Jun Morimoto,et al.  Learning CPG Sensory Feedback with Policy Gradient for Biped Locomotion for a Full-Body Humanoid , 2005, AAAI.

[3]  T. Moon,et al.  Mathematical Methods and Algorithms for Signal Processing , 1999 .

[4]  Douglas Aberdeen,et al.  POMDPs and Policy Gradients , 2006 .

[5]  Lennart Råde,et al.  Springers Mathematische Formeln , 1996 .

[6]  Shin Ishii,et al.  Fast and Stable Learning of Quasi-Passive Dynamic Walking by an Unstable Biped Robot based on Off-Policy Natural Actor-Critic , 2006, 2006 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[7]  James C. Spall,et al.  Introduction to stochastic search and optimization - estimation, simulation, and control , 2003, Wiley-Interscience series in discrete mathematics and optimization.

[8]  Oliver G. Selfridge,et al.  Real-time learning: a ball on a beam , 1992, [Proceedings 1992] IJCNN International Joint Conference on Neural Networks.

[9]  Shin Ishii,et al.  Reinforcement Learning for CPG-Driven Biped Robot , 2004, AAAI.

[10]  Stefan Schaal,et al.  Rapid synchronization and accurate phase-locking of rhythmic motor primitives , 2005, 2005 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[11]  A. Berny,et al.  Statistical machine learning and combinatorial optimization , 2001 .

[12]  James C. Spall,et al.  Introduction to Stochastic Search and Optimization. Estimation, Simulation, and Control (Spall, J.C. , 2007 .

[13]  Thomas Hofmann,et al.  Natural Actor-Critic for Road Traffic Optimisation , 2007 .

[14]  Lex Weaver,et al.  The Optimal Reward Baseline for Gradient-Based Reinforcement Learning , 2001, UAI.

[15]  Jin Yu,et al.  Natural Actor-Critic for Road Traffic Optimisation , 2006, NIPS.

[16]  Vijaykumar Gullapalli,et al.  Learning Control Under Extreme Uncertainty , 1992, NIPS.

[17]  Noah J. Cowan,et al.  Efficient Gradient Estimation for Motor Control Learning , 2002, UAI.

[18]  Richard S. Sutton,et al.  Neuronlike adaptive elements that can solve difficult learning control problems , 1983, IEEE Transactions on Systems, Man, and Cybernetics.

[19]  Machine Learning of Motor Skills for Robotics, Jan Peters , 2022 .

[20]  Sham M. Kakade,et al.  Optimizing Average Reward Using Discounted Rewards , 2001, COLT/EuroCOLT.

[21]  Jongho Kim,et al.  An RLS-Based Natural Actor-Critic Algorithm for Locomotion of a Two-Linked Robot Arm , 2005, CIS.

[22]  Shin Ishii,et al.  Natural Policy Gradient Reinforcement Learning for a CPG Control of a Biped Robot , 2004, PPSN.

[23]  Stefan Schaal,et al.  A Kendama learning robot based on a dynamic optimization theory , 1995, Proceedings 4th IEEE International Workshop on Robot and Human Communication.

[24]  J. Spall,et al.  Simulation-Based Optimization with Stochastic Approximation Using Common Random Numbers , 1999 .

[25]  Peter L. Bartlett,et al.  Experiments with Infinite-Horizon, Policy-Gradient Estimation , 2001, J. Artif. Intell. Res..

[26]  Stefan Schaal,et al.  Natural Actor-Critic , 2003, Neurocomputing.

[27]  Bruno Siciliano,et al.  Modeling and Control of Robot Manipulators , 1995 .

[28]  Jun Nakanishi,et al.  Learning Attractor Landscapes for Learning Motor Primitives , 2002, NIPS.

[29]  H. Sebastian Seung,et al.  Learning to Walk in 20 Minutes , 2005 .

[30]  Stefan Schaal,et al.  Reinforcement Learning for Humanoid Robotics , 2003 .

[31]  Shigenobu Kobayashi,et al.  Reinforcement learning for continuous action using stochastic gradient ascent , 1998 .

[32]  Shun-ichi Amari,et al.  Natural Gradient Works Efficiently in Learning , 1998, Neural Computation.

[33]  Tamar Flash,et al.  Motor primitives in vertebrates and invertebrates , 2005, Current Opinion in Neurobiology.

[34]  Peter L. Bartlett,et al.  Infinite-Horizon Policy-Gradient Estimation , 2001, J. Artif. Intell. Res..

[35]  S. Ishii,et al.  Off-Policy Natural Actor-Critic , 2005 .

[36]  Andrew Zisserman,et al.  Advances in Neural Information Processing Systems (NIPS) , 2007 .

[37]  M. Kawato,et al.  Trajectory formation of arm movement by a neural network with forward and inverse dynamics models , 1993 .

[38]  Peter L. Bartlett,et al.  Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning , 2001, J. Mach. Learn. Res..

[39]  Jun Nakanishi,et al.  Movement imitation with nonlinear dynamical systems in humanoid robots , 2002, Proceedings 2002 IEEE International Conference on Robotics and Automation (Cat. No.02CH37292).

[40]  Christopher G. Atkeson,et al.  Using Local Trajectory Optimizers to Speed Up Global Optimization in Dynamic Programming , 1993, NIPS.

[41]  J. Spall,et al.  Optimal random perturbations for stochastic approximation using a simultaneous perturbation gradient approximation , 1997, Proceedings of the 1997 American Control Conference (Cat. No.97CH36041).

[42]  Stefan Schaal,et al.  Policy Gradient Methods for Robotics , 2006, 2006 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[43]  W Wim Meiden,et al.  Book review: Springers mathematische Formeln. Taschenbuch für Ingenieure, Naturwissenschaftler, Wirtschaftswissenschaftler , 1998 .

[44]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[45]  D. Harville Matrix Algebra From a Statistician's Perspective , 1998 .

[46]  Jun Morimoto,et al.  Learning from demonstration and adaptation of biped locomotion , 2004, Robotics Auton. Syst..

[47]  M. Ciletti,et al.  The computation and theory of optimal control , 1972 .

[48]  J. Spall STOCHASTIC OPTIMIZATION , 2002 .

[49]  D. Signorini,et al.  Neural networks , 1995, The Lancet.

[50]  Jun Morimoto,et al.  Minimax Differential Dynamic Programming: An Application to Robust Biped Walking , 2002, NIPS.

[51]  Vijay Balasubramanian,et al.  Statistical Inference, Occam's Razor, and Statistical Mechanics on the Space of Probability Distributions , 1996, Neural Computation.

[52]  John N. Tsitsiklis,et al.  Actor-Critic Algorithms , 1999, NIPS.

[53]  Michael I. Jordan,et al.  PEGASUS: A policy search method for large MDPs and POMDPs , 2000, UAI.

[54]  Sham M. Kakade,et al.  On the sample complexity of reinforcement learning. , 2003 .

[55]  L. Hasdorff Gradient Optimization and Nonlinear Control , 1976 .

[56]  Sham M. Kakade,et al.  A Natural Policy Gradient , 2001, NIPS.

[57]  David Q. Mayne,et al.  Differential dynamic programming , 1972, The Mathematical Gazette.

[58]  Shin Ishii,et al.  Reinforcement Learning for Biped Locomotion , 2002, ICANN.

[59]  Alison L Gibbs,et al.  On Choosing and Bounding Probability Metrics , 2002, math/0209021.

[60]  R. Fletcher Practical Methods of Optimization , 1988 .

[61]  James C. Spall,et al.  Introduction to stochastic search and optimization - estimation, simulation, and control , 2003, Wiley-Interscience series in discrete mathematics and optimization.

[62]  Vijaykumar Gullapalli,et al.  A stochastic reinforcement learning algorithm for learning real-valued functions , 1990, Neural Networks.

[63]  P. Glynn LIKELIHOOD RATIO GRADIENT ESTIMATION : AN OVERVIEW by , 2022 .

[64]  Jeff G. Schneider,et al.  Covariant Policy Search , 2003, IJCAI.

[65]  Jun Nakanishi,et al.  Learning Movement Primitives , 2005, ISRR.

[66]  Takayuki Kanda,et al.  Robot behavior adaptation for human-robot interaction based on policy gradient reinforcement learning , 2005, 2005 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[67]  Aude Billard,et al.  Reinforcement learning for imitating constrained reaching movements , 2007, Adv. Robotics.

[68]  Michael C. Fu,et al.  Feature Article: Optimization for simulation: Theory vs. Practice , 2002, INFORMS J. Comput..

[69]  V. Gullapalli,et al.  Acquiring robot skills via reinforcement learning , 1994, IEEE Control Systems.

[70]  Peter W. Glynn,et al.  Likelilood ratio gradient estimation: an overview , 1987, WSC '87.

[71]  Peter W. Glynn,et al.  Likelihood ratio gradient estimation for stochastic systems , 1990, CACM.

[72]  Peter Stone,et al.  Policy gradient reinforcement learning for fast quadrupedal locomotion , 2004, IEEE International Conference on Robotics and Automation, 2004. Proceedings. ICRA '04. 2004.

[73]  Amir Karniel,et al.  Minimum Acceleration Criterion with Constraints Implies Bang-Bang Control as an Underlying Principle for Optimal Trajectories of Arm Reaching Movements , 2008, Neural Computation.

[74]  Judy A. Franklin,et al.  Biped dynamic walking using reinforcement learning , 1997, Robotics Auton. Syst..