Apprenticeship learning and reinforcement learning with application to robotic control

Many problems in robotics have unknown, stochastic, high-dimensional, and highly nonlinear dynamics, and offer significant challenges to both traditional control methods and reinforcement learning algorithms. Some of the key difficulties that arise in these problems are: (i) It is often difficult to write down, in closed form, a formal specification of the control task. For example, what is the objective function for "flying well"? (ii) It is often difficult to build a good dynamics model because of both data collection and data modeling challenges (similar to the "exploration problem" in reinforcement learning). (iii) It is often computationally expensive to find closed-loop controllers for high dimensional, stochastic domains. We describe learning algorithms with formal performance guarantees which show that these problems can be efficiently addressed in the apprenticeship learning setting—the setting when expert demonstrations of the task are available. Our algorithms are guaranteed to return a control policy with performance comparable to the expert's. We evaluate performance on the same task and in the same (typically stochastic, high-dimensional and non-linear) environment as the expert. Besides having theoretical guarantees, our algorithms have also enabled us to solve some previously unsolved real-world control problems: They have enabled a quadruped robot to traverse challenging, previously unseen terrain. They have significantly extended the state-of-the-art in autonomous helicopter flight. Our helicopter has performed by far the most challenging aerobatic maneuvers performed by any autonomous helicopter to date, including maneuvers such as continuous in-place flips, rolls and tic-tocs, which only exceptional expert human pilots can fly. Our aerobatic flight performance is comparable to that of the best human pilots.

[1]  R. E. Kalman,et al.  A New Approach to Linear Filtering and Prediction Problems , 2002 .

[2]  A. S. Manne Linear Programming and Sequential Decisions , 1960 .

[3]  R. Tyrrell Rockafellar,et al.  Convex Analysis , 1970, Princeton Landmarks in Mathematics and Physics.

[4]  S. B. Needleman,et al.  A general method applicable to the search for similarities in the amino acid sequence of two proteins. , 1970, Journal of molecular biology.

[5]  David Q. Mayne,et al.  Differential dynamic programming , 1972, The Mathematical Gazette.

[6]  J. K. Satia,et al.  Markovian Decision Processes with Uncertain Transition Probabilities , 1973, Oper. Res..

[7]  Arthur Gelb,et al.  Applied Optimal Estimation , 1974 .

[8]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Vol. II , 1976 .

[9]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[10]  S. Chiba,et al.  Dynamic programming algorithm optimization for spoken word recognition , 1978 .

[11]  E. J. Lefferts,et al.  Kalman Filtering for Spacecraft Attitude Estimation , 1982 .

[12]  N. Hogan An organizing principle for a class of voluntary movements , 1984, The Journal of neuroscience : the official journal of the Society for Neuroscience.

[13]  Lennart Ljung,et al.  System Identification: Theory for the User , 1987 .

[14]  Dean Pomerleau,et al.  ALVINN, an autonomous land vehicle in a neural network , 2015 .

[15]  宇野 洋二,et al.  Formation and control of optimal trajectory in human multijoint arm movement : minimum torque-change model , 1988 .

[16]  Christopher G. Atkeson,et al.  Model-Based Control of a Robot Manipulator , 1988 .

[17]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[18]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[19]  B. Anderson,et al.  Optimal control: linear quadratic methods , 1990 .

[20]  R. Durrett Probability: Theory and Examples , 1993 .

[21]  Simon Newman,et al.  Basic Helicopter Aerodynamics , 1990 .

[22]  David Williams,et al.  Probability with Martingales , 1991, Cambridge mathematical textbooks.

[23]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[24]  Mark B. Tischler,et al.  Frequency-Response Method for Rotorcraft System Identification: Flight Applications to BO 105 Coupled Rotor/Fuselage Dynamics , 1992 .

[25]  L. Jones A Simple Lemma on Greedy Approximation in Hilbert Space and Convergence Rates for Projection Pursuit Regression and Neural Network Training , 1992 .

[26]  Frank L. Lewis,et al.  Aircraft Control and Simulation , 1992 .

[27]  T D Gillespie,et al.  Fundamentals of Vehicle Dynamics , 1992 .

[28]  Chelsea C. White,et al.  Markov Decision Processes with Imprecise Transition Probabilities , 1994, Oper. Res..

[29]  Masayuki Inaba,et al.  Learning by watching: extracting reusable task knowledge from visual observation of human performance , 1994, IEEE Trans. Robotics Autom..

[30]  Hermann Ney,et al.  On structuring probabilistic dependences in stochastic language modelling , 1994, Comput. Speech Lang..

[31]  Stefan Schaal,et al.  Robot learning by nonparametric regression , 1994, Proceedings of IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS'94).

[32]  Gillian M. Hayes,et al.  A Robot Controller Using Learning by Imitation , 1994 .

[33]  Alan J. Laub,et al.  The LMI control toolbox , 1994, Proceedings of 1994 33rd IEEE Conference on Decision and Control.

[34]  Richard S. Sutton,et al.  TD Models: Modeling the World at a Mixture of Time Scales , 1995, ICML.

[35]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[36]  Andrew W. Moore,et al.  Reinforcement Learning: A Survey , 1996, J. Artif. Intell. Res..

[37]  Craig Boutilier,et al.  Context-Specific Independence in Bayesian Networks , 1996, UAI.

[38]  J. Doyle,et al.  Robust and optimal control , 1995, Proceedings of 35th IEEE Conference on Decision and Control.

[39]  P. Spreij Probability and Measure , 1996 .

[40]  Stefan Schaal,et al.  Robot Learning From Demonstration , 1997, ICML.

[41]  Doina Precup,et al.  Theoretical Results on Reinforcement Learning with Temporally Abstract Options , 1998, ECML.

[42]  Leslie Pack Kaelbling,et al.  Planning and Acting in Partially Observable Stochastic Domains , 1998, Artif. Intell..

[43]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[44]  Preben Alstrøm,et al.  Learning to Drive a Bicycle Using Reinforcement Learning and Shaping , 1998, ICML.

[45]  Geoffrey E. Hinton,et al.  A View of the Em Algorithm that Justifies Incremental, Sparse, and other Variants , 1998, Learning in Graphical Models.

[46]  S. King Learning to fly. , 1998, Nursing Times.

[47]  Takeo Kanade,et al.  System identification of small-size unmanned helicopter dynamics , 1999 .

[48]  Yishay Mansour,et al.  Approximate Planning in Large POMDPs via Reusable Trajectories , 1999, NIPS.

[49]  Kevin L. Moore,et al.  Iterative Learning Control: An Expository Overview , 1999 .

[50]  Andrew Y. Ng,et al.  Policy Invariance Under Reward Transformations: Theory and Application to Reward Shaping , 1999, ICML.

[51]  Michael Kearns,et al.  Efficient Reinforcement Learning in Factored MDPs , 1999, IJCAI.

[52]  J. Gordon Leishman,et al.  Principles of Helicopter Aerodynamics , 2000 .

[53]  Thomas G. Dietterich Hierarchical Reinforcement Learning with the MAXQ Value Function Decomposition , 1999, J. Artif. Intell. Res..

[54]  Leslie Pack Kaelbling,et al.  Practical Reinforcement Learning in Continuous Spaces , 2000, ICML.

[55]  Andrew Y. Ng,et al.  Pharmacokinetics of a novel formulation of ivermectin after administration to goats , 2000, ICML.

[56]  Jeff G. Schneider,et al.  Autonomous helicopter control using reinforcement learning policy search methods , 2001, Proceedings 2001 ICRA. IEEE International Conference on Robotics and Automation (Cat. No.01CH37164).

[57]  Michael I. Jordan,et al.  On Discriminative vs. Generative Classifiers: A comparison of logistic regression and naive Bayes , 2001, NIPS.

[58]  Jun Morimoto,et al.  Acquisition of stand-up behavior by a real robot using hierarchical reinforcement learning , 2000, Robotics Auton. Syst..

[59]  Eric Feron,et al.  Control Logic for Automated Aerobatic Flight of a Miniature Helicopter , 2002 .

[60]  Bernard Mettler,et al.  Flight test and simulation results for an autonomous aerobatic helicopter , 2002, Proceedings. The 21st Digital Avionics Systems Conference.

[61]  R. Amit,et al.  Learning movement sequences from demonstration , 2002, Proceedings 2nd International Conference on Development and Learning. ICDL 2002.

[62]  Ronen I. Brafman,et al.  R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning , 2001, J. Mach. Learn. Res..

[63]  Jun Morimoto,et al.  Minimax Differential Dynamic Programming: An Application to Robust Biped Walking , 2002, NIPS.

[64]  S. Shankar Sastry,et al.  Autonomous Helicopter Flight via Reinforcement Learning , 2003, NIPS.

[65]  Gaurav S. Sukhatme,et al.  Visually guided landing of an unmanned aerial vehicle , 2003, IEEE Trans. Robotics Autom..

[66]  John Langford,et al.  Exploration in Metric State Spaces , 2003, ICML.

[67]  Peter I. Corke,et al.  Low-cost flight control system for a small autonomous helicopter , 2003, 2003 IEEE International Conference on Robotics and Automation (Cat. No.03CH37422).

[68]  M. Kawato,et al.  Formation and control of optimal trajectory in human multijoint arm movement , 1989, Biological Cybernetics.

[69]  Radford M. Neal,et al.  Multiple Alignment of Continuous Time Series , 2004, NIPS.

[70]  Pieter Abbeel,et al.  Learning first-order Markov models for control , 2004, NIPS.

[71]  Peter Stone,et al.  Policy gradient reinforcement learning for fast quadrupedal locomotion , 2004, IEEE International Conference on Robotics and Automation, 2004. Proceedings. ICRA '04. 2004.

[72]  Andrew W. Moore,et al.  Locally Weighted Learning for Control , 1997, Artificial Intelligence Review.

[73]  Ben Tse,et al.  Autonomous Inverted Helicopter Flight via Reinforcement Learning , 2004, ISER.

[74]  Michael I. Jordan,et al.  Mixed Memory Markov Models: Decomposing Complex Stochastic Processes as Mixtures of Simpler Ones , 1999, Machine Learning.

[75]  Sham M. Kakade,et al.  Online Bounds for Bayesian Algorithms , 2004, NIPS.

[76]  Eric Feron,et al.  Human-Inspired Control Logic for Automated Maneuvering of Miniature Helicopter , 2004 .

[77]  Michael Kearns,et al.  Near-Optimal Reinforcement Learning in Polynomial Time , 2002, Machine Learning.

[78]  Pieter Abbeel,et al.  Apprenticeship learning via inverse reinforcement learning , 2004, ICML.

[79]  Pieter Abbeel,et al.  Exploration and apprenticeship learning in reinforcement learning , 2005, ICML.

[80]  Laurent El Ghaoui,et al.  Robust Solutions to Markov Decision Problems with Uncertain Transition Matrices , 2005 .

[81]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[82]  Pieter Abbeel,et al.  Learning vehicular dynamics, with application to modeling helicopters , 2005, NIPS.

[83]  G. Dullerud,et al.  A Course in Robust Control Theory: A Convex Approach , 2005 .

[84]  J. Andrew Bagnell,et al.  Maximum margin planning , 2006, ICML.

[85]  Pieter Abbeel,et al.  Using inaccurate models in reinforcement learning , 2006, ICML.

[86]  David M. Bradley,et al.  Boosting Structured Prediction for Imitation Learning , 2006, NIPS.

[87]  Robert E. Schapire,et al.  A Game-Theoretic Approach to Apprenticeship Learning , 2007, NIPS.

[88]  Aude Billard,et al.  On Learning, Representing, and Generalizing a Task in a Humanoid Robot , 2007, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[89]  Pieter Abbeel,et al.  Hierarchical Apprenticeship Learning with Application to Quadruped Locomotion , 2007, NIPS.

[90]  J. Listgarten Analysis of sibling time series data: Alignment and difference detection , 2007 .

[91]  Csaba Szepesvári,et al.  Apprenticeship Learning using Inverse Reinforcement Learning and Gradient Methods , 2007, UAI.

[92]  Eyal Amir,et al.  Bayesian Inverse Reinforcement Learning , 2007, IJCAI.

[93]  Sebastian Thrun,et al.  Apprenticeship learning for motion planning with application to parking lot navigation , 2008, 2008 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[94]  Andrew Y. Ng,et al.  A control architecture for quadruped locomotion over rough terrain , 2008, 2008 IEEE International Conference on Robotics and Automation.

[95]  Dirk Haehnel,et al.  Junior: The Stanford entry in the Urban Challenge , 2008 .

[96]  Sebastian Thrun,et al.  Path Planning for Autonomous Driving in Unknown Environments , 2008, ISER.

[97]  Pieter Abbeel,et al.  Learning for control from multiple demonstrations , 2008, ICML '08.

[98]  Sebastian Thrun,et al.  Junior: The Stanford entry in the Urban Challenge , 2008, J. Field Robotics.