From Pixels to Torques: Policy Learning with Deep Dynamical Models

Data-efficient learning in continuous state-action spaces using very high-dimensional observations remains a key challenge in developing fully autonomous systems. In this paper, we consider one instance of this challenge, the pixels-totorques problem, where an agent must learn a closed-loop control policy from pixel information only. We introduce a data-efficient, model-based reinforcement learning algorithm that learns such a closed-loop policy directly from pixel information. The key ingredient is a deep dynamical model that uses deep autoencoders to learn a low-dimensional embedding of images jointly with a prediction model in this low-dimensional feature space. This joint learning ensures that not only static properties of the data are accounted for, but also dynamic properties. This is crucial for long-term predictions, which lie at the core of the adaptive model predictive control strategy that we use for closedloop control. Compared to state-of-the-art reinforcement learning methods, our approach learns quickly, scales to high-dimensional state spaces and facilitates fully autonomous learning from pixels to torques.

[1]  Lennart Ljung,et al.  System Identification: Theory for the User , 1987 .

[2]  Jürgen Schmidhuber,et al.  An on-line algorithm for dynamic reinforcement learning and planning in reactive environments , 1990, 1990 IJCNN International Joint Conference on Neural Networks.

[3]  Jeff G. Schneider,et al.  Exploiting Model Uncertainty Estimates for Safe Dynamic Control Learning , 1996, NIPS.

[4]  Stefan Schaal,et al.  Learning tasks from a single demonstration , 1997, Proceedings of International Conference on Robotics and Automation.

[5]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[6]  Stephen J. Wright,et al.  Numerical Optimization , 2018, Fundamental Statistical Inference.

[7]  David Q. Mayne,et al.  Constrained model predictive control: Stability and optimality , 2000, Autom..

[8]  Stephen J. Wright,et al.  Numerical Optimization (Springer Series in Operations Research and Financial Engineering) , 2000 .

[9]  Jeff G. Schneider,et al.  Autonomous helicopter control using reinforcement learning policy search methods , 2001, Proceedings 2001 ICRA. IEEE International Conference on Robotics and Automation (Cat. No.01CH37164).

[10]  Simon Haykin,et al.  GradientBased Learning Applied to Document Recognition , 2001 .

[11]  Jan M. Maciejowski,et al.  Predictive control : with constraints , 2002 .

[12]  S. Joe Qin,et al.  A survey of industrial model predictive control technology , 2003 .

[13]  H. Bourlard,et al.  Auto-association by multilayer perceptrons and singular value decomposition , 1988, Biological Cybernetics.

[14]  Christopher K. I. Williams,et al.  Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning) , 2005 .

[15]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[16]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[17]  Eduardo F. Camacho,et al.  Constrained Model Predictive Control , 2007 .

[18]  Thomas Hofmann,et al.  Greedy Layer-Wise Training of Deep Networks , 2007 .

[19]  Daohang Sha,et al.  A new neural networks based adaptive model predictive control for unknown multiple variable non-linear systems , 2008 .

[20]  Yoshua Bengio,et al.  Extracting and composing robust features with denoising autoencoders , 2008, ICML '08.

[21]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[22]  Carl E. Rasmussen,et al.  Gaussian process dynamic programming , 2009, Neurocomputing.

[23]  Martin A. Riedmiller,et al.  Deep auto-encoder neural networks in reinforcement learning , 2010, The 2010 International Joint Conference on Neural Networks (IJCNN).

[24]  Carl E. Rasmussen,et al.  PILCO: A Model-Based and Data-Efficient Approach to Policy Search , 2011, ICML.

[25]  Jurgen Schmidhuber,et al.  Intrinsically motivated neuroevolution for vision-based reinforcement learning , 2011, 2011 IEEE International Conference on Development and Learning (ICDL).

[26]  Martin A. Riedmiller,et al.  Autonomous reinforcement learning on raw visual input data in a real world application , 2012, The 2012 International Joint Conference on Neural Networks (IJCNN).

[27]  James B. Rawlings,et al.  Postface to “ Model Predictive Control : Theory and Design ” , 2012 .

[28]  Alex Graves,et al.  Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[29]  Jürgen Schmidhuber,et al.  Evolving large-scale neural networks for vision-based reinforcement learning , 2013, GECCO '13.

[30]  Byron Boots,et al.  Learning predictive models of a depth camera & manipulator from raw execution traces , 2014, 2014 IEEE International Conference on Robotics and Automation (ICRA).

[31]  David Q. Mayne,et al.  Model predictive control: Recent developments and future promise , 2014, Autom..

[32]  Ludovic Denoyer,et al.  Learning States Representations in POMDP , 2014, ICLR.

[33]  Thomas B. Schön,et al.  Learning deep dynamical models from image pixels , 2014, ArXiv.

[34]  Yunpeng Pan,et al.  Probabilistic Differential Dynamic Programming , 2014, NIPS.

[35]  Martin A. Riedmiller,et al.  Approximate real-time optimal control based on sparse Gaussian process models , 2014, 2014 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL).

[36]  S. Khosravi,et al.  Constrained model predictive control of hypnosis , 2015 .

[37]  Jan Peters,et al.  Learning of Non-Parametric Control Policies with High-Dimensional State Features , 2015, AISTATS.

[38]  Carl E. Rasmussen,et al.  Gaussian Processes for Data-Efficient Learning in Robotics and Control , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[39]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[40]  Sergey Levine,et al.  End-to-End Training of Deep Visuomotor Policies , 2015, J. Mach. Learn. Res..

[41]  Stefan Schaal,et al.  Learning from Demonstration , 1996, NIPS.