Preparing for the Unknown: Learning a Universal Policy with Online System Identification

We present a new method of learning control policies that successfully operate under unknown dynamic models. We create such policies by leveraging a large number of training examples that are generated using a physical simulator. Our system is made of two components: a Universal Policy (UP) and a function for Online System Identification (OSI). We describe our control policy as universal because it is trained over a wide array of dynamic models. These variations in the dynamic model may include differences in mass and inertia of the robots' components, variable friction coefficients, or unknown mass of an object to be manipulated. By training the Universal Policy with this variation, the control policy is prepared for a wider array of possible conditions when executed in an unknown environment. The second part of our system uses the recent state and action history of the system to predict the dynamics model parameters mu. The value of mu from the Online System Identification is then provided as input to the control policy (along with the system state). Together, UP-OSI is a robust control policy that can be used across a wide range of dynamic models, and that is also responsive to sudden changes in the environment. We have evaluated the performance of this system on a variety of tasks, including the problem of cart-pole swing-up, the double inverted pendulum, locomotion of a hopper, and block-throwing of a manipulator. UP-OSI is effective at these tasks across a wide range of dynamic models. Moreover, when tested with dynamic models outside of the training range, UP-OSI outperforms the Universal Policy alone, even when UP is given the actual value of the model dynamics. In addition to the benefits of creating more robust controllers, UP-OSI also holds out promise of narrowing the Reality Gap between simulated and real physical systems.

[1]  D. Floreano,et al.  Evolutionary Robotics: The Biology,Intelligence,and Technology , 2000 .

[2]  András Lörincz,et al.  MDPs: Learning in Varying Environments , 2003, J. Mach. Learn. Res..

[3]  Javier Ruiz-del-Solar,et al.  Back to reality: Crossing the reality gap in evolutionary robotics , 2004 .

[4]  Pieter Abbeel,et al.  Exploration and apprenticeship learning in reinforcement learning , 2005, ICML.

[5]  Hod Lipson,et al.  Nonlinear system identification using coevolution of models and tests , 2005, IEEE Transactions on Evolutionary Computation.

[6]  Pieter Abbeel,et al.  Using inaccurate models in reinforcement learning , 2006, ICML.

[7]  Michel Gevers,et al.  System identification without Lennart Ljung : what would have been different ? , 2006 .

[8]  Stéphane Doncieux,et al.  Crossing the reality gap in evolutionary robotics by promoting transferable controllers , 2010, GECCO '10.

[9]  Carl E. Rasmussen,et al.  PILCO: A Model-Based and Data-Efficient Approach to Policy Search , 2011, ICML.

[10]  J. Andrew Bagnell,et al.  Agnostic System Identification for Model-Based Reinforcement Learning , 2012, ICML.

[11]  Bruno Castro da Silva,et al.  Learning Parameterized Skills , 2012, ICML.

[12]  Olivier Sigaud,et al.  Learning compact parameterized skills with a single regression , 2013, 2013 13th IEEE-RAS International Conference on Humanoid Robots (Humanoids).

[13]  Guy Lever,et al.  Deterministic Policy Gradient Algorithms , 2014, ICML.

[14]  Emanuel Todorov,et al.  Ensemble-CIO: Full-body dynamic motion planning that transfers to physical humanoids , 2015, 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[15]  Sehoon Ha,et al.  Reducing hardware experiments for model learning and policy optimization , 2015, 2015 IEEE International Conference on Robotics and Automation (ICRA).

[16]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[17]  Pieter Abbeel,et al.  Deep learning helicopter dynamics models , 2015, 2015 IEEE International Conference on Robotics and Automation (ICRA).

[18]  David Silver,et al.  Memory-based control with recurrent neural networks , 2015, ArXiv.

[19]  Nolan Wagener,et al.  Learning contact-rich manipulation skills with guided policy search , 2015, 2015 IEEE International Conference on Robotics and Automation (ICRA).

[20]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[21]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[22]  Wojciech Zaremba,et al.  Transfer from Simulation to Real World through Learning Deep Inverse Dynamics Model , 2016, ArXiv.

[23]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[24]  Stephen James,et al.  3D Simulation for Robot Arm Control with Deep Q-Learning , 2016, ArXiv.

[25]  Sergey Levine,et al.  End-to-End Training of Deep Visuomotor Policies , 2015, J. Mach. Learn. Res..

[26]  Glen Berseth,et al.  Terrain-adaptive locomotion skills using deep reinforcement learning , 2016, ACM Trans. Graph..

[27]  Sergey Levine,et al.  High-Dimensional Continuous Control Using Generalized Advantage Estimation , 2015, ICLR.

[28]  Abhinav Gupta,et al.  Supersizing self-supervision: Learning to grasp from 50K tries and 700 robot hours , 2015, 2016 IEEE International Conference on Robotics and Automation (ICRA).

[29]  Razvan Pascanu,et al.  Sim-to-Real Robot Learning from Pixels with Progressive Nets , 2016, CoRL.

[30]  Sergey Levine,et al.  Q-Prop: Sample-Efficient Policy Gradient with An Off-Policy Critic , 2016, ICLR.

[31]  Balaraman Ravindran,et al.  EPOpt: Learning Robust Neural Network Policies Using Model Ensembles , 2016, ICLR.

[32]  Angela P. Schoellig,et al.  High-precision trajectory tracking in changing environments through L1 adaptive feedback and iterative learning , 2017, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[33]  Sergey Levine,et al.  Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection , 2016, Int. J. Robotics Res..

[34]  Siddhartha S. Srinivasa,et al.  DART: Dynamic Animation and Robotics Toolkit , 2018, J. Open Source Softw..