Model-based reinforcement learning with parametrized physical models and optimism-driven exploration

In this paper, we present a robotic model-based reinforcement learning method that combines ideas from model identification and model predictive control. We use a feature-based representation of the dynamics that allows the dynamics model to be fitted with a simple least squares procedure, and the features are identified from a high-level specification of the robot's morphology, consisting of the number and connectivity structure of its links. Model predictive control is then used to choose the actions under an optimistic model of the dynamics, which produces an efficient and goal-directed exploration strategy. We present real time experimental results on standard benchmark problems involving the pendulum, cartpole, and double pendulum systems. Experiments indicate that our method is able to learn a range of benchmark tasks substantially faster than the previous best methods. To evaluate our approach on a realistic robotic control task, we also demonstrate real time control of a simulated 7 degree of freedom arm.

[1]  Brian Armstrong,et al.  On Finding Exciting Trajectories for Identification Experiments Involving Systems with Nonlinear Dynamics , 1989, Int. J. Robotics Res..

[2]  M. Gautier,et al.  Exciting Trajectories for the Identification of Base Inertial Parameters of Robots , 1991, [1991] Proceedings of the 30th IEEE Conference on Decision and Control.

[3]  R.J. Williams,et al.  Reinforcement learning is direct adaptive optimal control , 1991, IEEE Control Systems.

[4]  Jan Swevers,et al.  Optimal robot excitation and identification , 1997, IEEE Trans. Robotics Autom..

[5]  Marko Bacic,et al.  Model predictive control , 2003 .

[6]  Pieter Abbeel,et al.  An Application of Reinforcement Learning to Aerobatic Helicopter Flight , 2006, NIPS.

[7]  Ambuj Tewari,et al.  Optimistic Linear Programming gives Logarithmic Regret for Irreducible MDPs , 2007, NIPS.

[8]  Wisama Khalil,et al.  Model Identification , 2019, Springer Handbook of Robotics, 2nd Ed..

[9]  Csaba Szepesvári,et al.  Exploration-exploitation tradeoff using variance estimates in multi-armed bandits , 2009, Theor. Comput. Sci..

[10]  Yuval Tassa,et al.  Control-limited differential dynamic programming , 2014, 2014 IEEE International Conference on Robotics and Automation (ICRA).

[11]  Jan Peters,et al.  Using model knowledge for learning inverse dynamics , 2010, 2010 IEEE International Conference on Robotics and Automation.

[12]  Jun Wu,et al.  Review: An overview of dynamic parameter identification of robots , 2010 .

[13]  Carl E. Rasmussen,et al.  PILCO: A Model-Based and Data-Efficient Approach to Policy Search , 2011, ICML.

[14]  Csaba Szepesvári,et al.  Regret Bounds for the Adaptive Control of Linear Quadratic Systems , 2011, COLT.

[15]  Olivier Buffet,et al.  Near-Optimal BRL using Optimistic Local Transitions , 2012, ICML.

[16]  Yuval Tassa,et al.  Synthesis and stabilization of complex behaviors through online trajectory optimization , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[17]  Biao Huang,et al.  System Identification , 2000, Control Theory for Physicists.

[18]  Gerd Hirzinger,et al.  Robot excitation trajectories for dynamic parameter estimation using optimized B-splines , 2012, 2012 IEEE International Conference on Robotics and Automation.

[19]  Scott Kuindersma,et al.  Variational Bayesian Optimization for Runtime Risk-Sensitive Control , 2012, Robotics: Science and Systems.

[20]  Gentiane Venture,et al.  Identification of consistent standard dynamic parameters of industrial robots , 2013, 2013 IEEE/ASME International Conference on Advanced Intelligent Mechatronics.

[21]  Jan Peters,et al.  Reinforcement learning in robotics: A survey , 2013, Int. J. Robotics Res..

[22]  Cong Wang,et al.  Fast planning of well conditioned trajectories for model learning , 2014, 2014 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[23]  Rui Pedro Duarte Cortesão,et al.  Physical feasibility of robot base inertial parameter identification: A linear matrix inequality approach , 2014, Int. J. Robotics Res..

[24]  Jonathan P. How,et al.  Reinforcement learning with multi-fidelity simulators , 2014, 2014 IEEE International Conference on Robotics and Automation (ICRA).

[25]  Martin A. Riedmiller,et al.  Approximate real-time optimal control based on sparse Gaussian process models , 2014, 2014 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL).

[26]  Angela P. Schoellig,et al.  Learning-based nonlinear model predictive control to improve vision-based mobile robot path-tracking in challenging outdoor environments , 2014, 2014 IEEE International Conference on Robotics and Automation (ICRA).

[27]  Sergey Levine,et al.  Optimism-driven exploration for nonlinear systems , 2015, 2015 IEEE International Conference on Robotics and Automation (ICRA).

[28]  Jonathan P. How,et al.  Efficient reinforcement learning for robots using informative simulated priors , 2015, 2015 IEEE International Conference on Robotics and Automation (ICRA).