Trajectory-based Optimal Control Techniques the Basics of Learning Control

I n a not too distant future, robots will be a natural part of daily life in human society, providing assistance in many areas ranging from clinical applications, education and care giving, to normal household environments [1]. It is hard to imagine that all possible tasks can be preprogrammed in such robots. Robots need to be able to learn, either by themselves or with the help of human supervision. Additionally, wear and tear on robots in daily use needs to be automatically compensated for, which requires a form of continuous self-calibration, another form of learning. Finally, robots need to react to sto-chastic and dynamic environments, i.e., they need to learn how to optimally adapt to uncertainty and unforeseen changes. Robot learning is going to be a key ingredient for the future of autonomous robots. While robot learning covers a rather large field, from learning to perceive, to plan, to make decisions, etc., we will focus this review on topics of learning control, in particular, as it is concerned with learning control in simulated or actual physical robots. In general, learning control refers to the process of acquiring a control strategy for a particular control system and a particular task by trial and error. Learning control is usually distinguished from adaptive control [2] in that the learning system can have rather general optimization objectives—not just, e.g., minimal tracking error—and is permitted to fail during the process of learning, while adaptive control emphasizes fast convergence without failure. Thus, learning control resembles the way that humans and animals acquire new movement strategies, while adaptive control is a special case of learning control that fulfills stringent performance constraints, e.g., as needed in life-critical systems like airplanes. Learning control has been an active topic of research for at least three decades. However, given the lack of working robots that actually use learning components, more work needs to be done before robot learning will make it beyond the laboratory environment. This article will survey some ongoing and past activities in robot learning to assess where the field stands and where it is going. We will largely focus on nonwheeled robots and less on topics of state estimation, as typically explored in wheeled robots [3]–6], and we emphasize learning in continuous state-action spaces rather than discrete state-action spaces [7], [8]. We will illustrate the different topics of robot learning with examples from our own research with anthropomorphic …

[1]  Geoffrey J. Gordon,et al.  Finding Approximate POMDP solutions Through Belief Compression , 2011, J. Artif. Intell. Res..

[2]  Christopher G. Atkeson,et al.  Control of a walking biped using a combination of simple policies , 2009, 2009 9th IEEE-RAS International Conference on Humanoid Robots.

[3]  Sanjiv Singh,et al.  The DARPA Urban Challenge: Autonomous Vehicles in City Traffic, George Air Force Base, Victorville, California, USA , 2009, The DARPA Urban Challenge.

[4]  Christopher G. Atkeson,et al.  Standing balance control using a trajectory library , 2009, 2009 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[5]  Emanuel Todorov,et al.  Efficient computation of optimal actions , 2009, Proceedings of the National Academy of Sciences.

[6]  David Silver,et al.  Learning to search: Functional gradient techniques for imitation learning , 2009, Auton. Robots.

[7]  Jan Peters,et al.  Learning motor primitives for robotics , 2009, 2009 IEEE International Conference on Robotics and Automation.

[8]  Carl E. Rasmussen,et al.  Gaussian process dynamic programming , 2009, Neurocomputing.

[9]  Duy Nguyen-Tuong,et al.  Local Gaussian Process Regression for Real Time Online Model Learning , 2008, NIPS.

[10]  Jan Peters,et al.  Fitted Q-iteration by Advantage Weighted Regression , 2008, NIPS.

[11]  Jürgen Schmidhuber,et al.  State-Dependent Exploration for Policy Gradient Methods , 2008, ECML/PKDD.

[12]  Stefan Schaal,et al.  A Bayesian approach to empirical local linearization for robotics , 2008, 2008 IEEE International Conference on Robotics and Automation.

[13]  Stefan Schaal,et al.  2008 Special Issue: Reinforcement learning of motor skills with policy gradients , 2008 .

[14]  Stefan Schaal,et al.  Learning to Control in Operational Space , 2008, Int. J. Robotics Res..

[15]  Christopher G. Atkeson,et al.  Random Sampling of States in Dynamic Programming , 2007, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[16]  Sanjiv Singh,et al.  The 2005 DARPA Grand Challenge: The Great Robot Race , 2007 .

[17]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[18]  Stefan Schaal,et al.  The New Robotics—towards Human-centered Machines , 2007 .

[19]  C. Atkeson Randomly Sampling Actions In Dynamic Programming , 2007, 2007 IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning.

[20]  H. Kappen An introduction to stochastic control theory, path integrals and reinforcement learning , 2007 .

[21]  Stefan Schaal,et al.  Policy Gradient Methods for Robotics , 2006, 2006 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[22]  Marc Toussaint,et al.  Probabilistic inference for solving discrete and continuous state Markov Decision Processes , 2006, ICML.

[23]  Jun Morimoto,et al.  Learning CPG-based Biped Locomotion with a Policy Gradient Method: Application to a Humanoid Robot , 2005, 5th IEEE-RAS International Conference on Humanoid Robots, 2005..

[24]  Stefan Schaal,et al.  Incremental Online Learning in High Dimensions , 2005, Neural Computation.

[25]  Pierre Geurts,et al.  Tree-Based Batch Mode Reinforcement Learning , 2005, J. Mach. Learn. Res..

[26]  H. Kappen Linear theory for control of nonlinear stochastic systems. , 2004, Physical review letters.

[27]  H. Sebastian Seung,et al.  Stochastic policy gradient reinforcement learning on a simple 3D biped , 2004, 2004 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (IEEE Cat. No.04CH37566).

[28]  Jessica K. Hodgins,et al.  Synthesizing physically realistic human motion in low-dimensional, behavior-specific spaces , 2004, ACM Trans. Graph..

[29]  Pieter Abbeel,et al.  Apprenticeship learning via inverse reinforcement learning , 2004, ICML.

[30]  Yoshihiko Nakamura,et al.  Embodied Symbol Emergence Based on Mimesis Theory , 2004, Int. J. Robotics Res..

[31]  Stefan Schaal,et al.  Natural Actor-Critic , 2003, Neurocomputing.

[32]  Stefan Schaal,et al.  http://www.jstor.org/about/terms.html. JSTOR's Terms and Conditions of Use provides, in part, that unless you have obtained , 2007 .

[33]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[34]  Andrew W. Moore,et al.  Variable Resolution Discretization in Optimal Control , 2002, Machine Learning.

[35]  Stefan Schaal,et al.  Learning inverse kinematics , 2001, Proceedings 2001 IEEE/RSJ International Conference on Intelligent Robots and Systems. Expanding the Societal Role of Robotics in the the Next Millennium (Cat. No.01CH37180).

[36]  Sham M. Kakade,et al.  A Natural Policy Gradient , 2001, NIPS.

[37]  Jon Rigelsford,et al.  Modelling and Control of Robot Manipulators , 2000 .

[38]  Michael I. Jordan,et al.  PEGASUS: A policy search method for large MDPs and POMDPs , 2000, UAI.

[39]  Jun Morimoto,et al.  Acquisition of stand-up behavior by a real robot using hierarchical reinforcement learning , 2000, Robotics Auton. Syst..

[40]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[41]  Shun-ichi Amari,et al.  Natural Gradient Learning for Over- and Under-Complete Bases in ICA , 1999, Neural Computation.

[42]  AmariShun-Ichi,et al.  Natural Gradient Learning for Over-and Under-Complete Bases in ICA , 1999 .

[43]  Stefan Schaal,et al.  Is imitation learning the route to humanoid robots? , 1999, Trends in Cognitive Sciences.

[44]  Christopher G. Atkeson,et al.  Constructive Incremental Learning from Only Local Information , 1998, Neural Computation.

[45]  D M Wolpert,et al.  Multiple paired forward and inverse models for motor control , 1998, Neural Networks.

[46]  Stefan Schaal,et al.  Robot Learning From Demonstration , 1997, ICML.

[47]  J. Spall,et al.  Optimal random perturbations for stochastic approximation using a simultaneous perturbation gradient approximation , 1997, Proceedings of the 1997 American Control Conference (Cat. No.97CH36041).

[48]  Stefan Schaal,et al.  Learning tasks from a single demonstration , 1997, Proceedings of International Conference on Robotics and Automation.

[49]  Andrew W. Moore,et al.  Locally Weighted Learning for Control , 1997, Artificial Intelligence Review.

[50]  Andrew W. Moore,et al.  Locally Weighted Learning , 1997, Artificial Intelligence Review.

[51]  Stefan Schaal,et al.  Learning from Demonstration , 1996, NIPS.

[52]  Ben J. A. Kröse,et al.  Learning from delayed rewards , 1995, Robotics Auton. Syst..

[53]  Michael I. Jordan,et al.  Supervised learning from incomplete data via an EM approach , 1993, NIPS.

[54]  S. Grossberg,et al.  A Self-Organizing Neural Model of Motor Equivalent Reaching and Tool Use by a Multijoint Arm , 1993, Journal of Cognitive Neuroscience.

[55]  Michael I. Jordan,et al.  Forward Models: Supervised Learning with a Distal Teacher , 1992, Cogn. Sci..

[56]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[57]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[58]  W. Cleveland Robust Locally Weighted Regression and Smoothing Scatterplots , 1979 .

[59]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[60]  Yaakov Bar-Shalom,et al.  Caution, Probing, and the Value of Information in the Control of Uncertain Systems , 1976 .

[61]  Y. Bar-Shalom,et al.  Wide-sense adaptive dual control for nonlinear stochastic systems , 1973 .

[62]  M. Ciletti,et al.  The computation and theory of optimal control , 1972 .

[63]  David Q. Mayne,et al.  Differential dynamic programming , 1972, The Mathematical Gazette.

[64]  Henk Nijmeijer,et al.  Robot Programming by Demonstration , 2010, SIMPAR.

[65]  Warren B. Powell,et al.  Handbook of Learning and Approximate Dynamic Programming , 2006, IEEE Transactions on Automatic Control.

[66]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[67]  Jun Nakanishi,et al.  Learning Attractor Landscapes for Learning Motor Primitives , 2002, NIPS.

[68]  Jonathan Baxter,et al.  Scaling Internal-State Policy-Gradient Methods for POMDPs , 2002 .

[69]  Manfred Opper,et al.  Sparse Representation for Gaussian Process Models , 2000, NIPS.

[70]  Andrew Y. Ng,et al.  Pharmacokinetics of a novel formulation of ivermectin after administration to goats , 2000, ICML.

[71]  Kenji Doya,et al.  Reinforcement Learning in Continuous Time and Space , 2000, Neural Computation.

[72]  Geoffrey E. Hinton,et al.  Using EM for Reinforcement Learning , 2000 .

[73]  Ieee Robotics,et al.  IEEE robotics & automation magazine , 1994 .

[74]  M. Kawato,et al.  Trajectory formation of arm movement by a neural network with forward and inverse dynamics models , 1993 .

[75]  Vijaykumar Gullapalli,et al.  A stochastic reinforcement learning algorithm for learning real-valued functions , 1990, Neural Networks.

[76]  Christopher G. Atkeson,et al.  Using Local Models to Control Movement , 1989, NIPS.