Towards Generalization and Simplicity in Continuous Control

This work shows that policies with simple linear and RBF parameterizations can be trained to solve a variety of continuous control tasks, including the OpenAI gym benchmarks. The performance of these trained policies are competitive with state of the art results, obtained with more elaborate parameterizations such as fully connected neural networks. Furthermore, existing training and testing scenarios are shown to be very limited and prone to over-fitting, thus giving rise to only trajectory-centric policies. Training with a diverse initial state distribution is shown to produce more global policies with better generalization. This allows for interactive control scenarios where the system recovers from large on-line perturbations; as shown in the supplementary video.

[1]  Leemon C. Baird,et al.  Residual Algorithms: Reinforcement Learning with Function Approximation , 1995, ICML.

[2]  Shun-ichi Amari,et al.  Natural Gradient Works Efficiently in Learning , 1998, Neural Computation.

[3]  Sham M. Kakade,et al.  A Natural Policy Gradient , 2001, NIPS.

[4]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[5]  Warren B. Powell,et al.  Handbook of Learning and Approximate Dynamic Programming , 2006, IEEE Transactions on Automatic Control.

[6]  Benjamin Recht,et al.  Random Features for Large-Scale Kernel Machines , 2007, NIPS.

[7]  Jan Peters,et al.  Machine Learning for motor skills in robotics , 2008, Künstliche Intell..

[8]  Dimitri P. Bertsekas,et al.  Approximate Dynamic Programming , 2017, Encyclopedia of Machine Learning and Data Mining.

[9]  Yuval Tassa,et al.  Infinite-Horizon Model Predictive Control for Periodic Tasks with Contacts , 2011, Robotics: Science and Systems.

[10]  Yuval Tassa,et al.  Synthesis and stabilization of complex behaviors through online trajectory optimization , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[11]  Yuval Tassa,et al.  MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[12]  Zoran Popovic,et al.  Discovery of complex behaviors through contact-invariant optimization , 2012, ACM Trans. Graph..

[13]  Aaron Hertzmann,et al.  Trajectory Optimization for Full-Body Movements with Complex Contacts , 2013, IEEE Transactions on Visualization and Computer Graphics.

[14]  Alborz Geramifard,et al.  A Tutorial on Linear Function Approximators for Dynamic Programming and Reinforcement Learning , 2013, Found. Trends Mach. Learn..

[15]  Yuval Tassa,et al.  An integrated system for real-time model predictive control of humanoid robots , 2013, 2013 13th IEEE-RAS International Conference on Humanoid Robots (Humanoids).

[16]  Emanuel Todorov,et al.  Ensemble-CIO: Full-body dynamic motion planning that transfers to physical humanoids , 2015, 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[17]  Yuval Tassa,et al.  Learning Continuous Control Policies by Stochastic Value Gradients , 2015, NIPS.

[18]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[19]  Zoran Popovic,et al.  Interactive Control of Diverse Complex Characters with Neural Networks , 2015, NIPS.

[20]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[21]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[22]  Pieter Abbeel,et al.  Benchmarking Deep Reinforcement Learning for Continuous Control , 2016, ICML.

[23]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Sergey Levine,et al.  Optimal control with learned local models: Application to dexterous manipulation , 2016, 2016 IEEE International Conference on Robotics and Automation (ICRA).

[25]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[26]  Sergey Levine,et al.  Learning Dexterous Manipulation Policies from Experience and Imitation , 2016, ArXiv.

[27]  Sergey Levine,et al.  End-to-End Training of Deep Visuomotor Policies , 2015, J. Mach. Learn. Res..

[28]  Sergey Levine,et al.  High-Dimensional Continuous Control Using Generalized Advantage Estimation , 2015, ICLR.

[29]  Abhinav Gupta,et al.  Supersizing self-supervision: Learning to grasp from 50K tries and 700 robot hours , 2015, 2016 IEEE International Conference on Robotics and Automation (ICRA).

[30]  Sergey Levine,et al.  Q-Prop: Sample-Efficient Policy Gradient with An Off-Policy Critic , 2016, ICLR.

[31]  Wojciech Zaremba,et al.  Domain randomization for transferring deep neural networks from simulation to the real world , 2017, 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[32]  Balaraman Ravindran,et al.  EPOpt: Learning Robust Neural Network Policies Using Model Ensembles , 2016, ICLR.

[33]  Sergey Levine,et al.  (CAD)$^2$RL: Real Single-Image Flight without a Single Real Image , 2016, Robotics: Science and Systems.