论文信息 - From Pixels to Torques: Policy Learning with Deep Dynamical Models

From Pixels to Torques: Policy Learning with Deep Dynamical Models

Data-efficient learning in continuous state-action spaces using very high-dimensional observations remains a key challenge in developing fully autonomous systems. In this paper, we consider one instance of this challenge, the pixels-totorques problem, where an agent must learn a closed-loop control policy from pixel information only. We introduce a data-efficient, model-based reinforcement learning algorithm that learns such a closed-loop policy directly from pixel information. The key ingredient is a deep dynamical model that uses deep autoencoders to learn a low-dimensional embedding of images jointly with a prediction model in this low-dimensional feature space. This joint learning ensures that not only static properties of the data are accounted for, but also dynamic properties. This is crucial for long-term predictions, which lie at the core of the adaptive model predictive control strategy that we use for closedloop control. Compared to state-of-the-art reinforcement learning methods, our approach learns quickly, scales to high-dimensional state spaces and facilitates fully autonomous learning from pixels to torques.

[1] Lennart Ljung,et al. System Identification: Theory for the User , 1987 .

[2] Jürgen Schmidhuber,et al. An on-line algorithm for dynamic reinforcement learning and planning in reactive environments , 1990, 1990 IJCNN International Joint Conference on Neural Networks.

[3] Jeff G. Schneider,et al. Exploiting Model Uncertainty Estimates for Safe Dynamic Control Learning , 1996, NIPS.

[4] Stefan Schaal,et al. Learning tasks from a single demonstration , 1997, Proceedings of International Conference on Robotics and Automation.

[5] Yoshua Bengio,et al. Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[6] Stephen J. Wright,et al. Numerical Optimization , 2018, Fundamental Statistical Inference.

[7] David Q. Mayne,et al. Constrained model predictive control: Stability and optimality , 2000, Autom..

[8] Stephen J. Wright,et al. Numerical Optimization (Springer Series in Operations Research and Financial Engineering) , 2000 .

[9] Jeff G. Schneider,et al. Autonomous helicopter control using reinforcement learning policy search methods , 2001, Proceedings 2001 ICRA. IEEE International Conference on Robotics and Automation (Cat. No.01CH37164).

[10] Simon Haykin,et al. GradientBased Learning Applied to Document Recognition , 2001 .

[11] Jan M. Maciejowski,et al. Predictive control : with constraints , 2002 .

[12] S. Joe Qin,et al. A survey of industrial model predictive control technology , 2003 .

[13] H. Bourlard,et al. Auto-association by multilayer perceptrons and singular value decomposition , 1988, Biological Cybernetics.

[14] Christopher K. I. Williams,et al. Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning) , 2005 .

[15] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[16] Geoffrey E. Hinton,et al. Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[17] Eduardo F. Camacho,et al. Constrained Model Predictive Control , 2007 .

[18] Thomas Hofmann,et al. Greedy Layer-Wise Training of Deep Networks , 2007 .

[19] Daohang Sha,et al. A new neural networks based adaptive model predictive control for unknown multiple variable non-linear systems , 2008 .

[20] Yoshua Bengio,et al. Extracting and composing robust features with denoising autoencoders , 2008, ICML '08.

[21] Carl E. Rasmussen,et al. Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[22] Carl E. Rasmussen,et al. Gaussian process dynamic programming , 2009, Neurocomputing.

[23] Martin A. Riedmiller,et al. Deep auto-encoder neural networks in reinforcement learning , 2010, The 2010 International Joint Conference on Neural Networks (IJCNN).

[24] Carl E. Rasmussen,et al. PILCO: A Model-Based and Data-Efficient Approach to Policy Search , 2011, ICML.

[25] Jurgen Schmidhuber,et al. Intrinsically motivated neuroevolution for vision-based reinforcement learning , 2011, 2011 IEEE International Conference on Development and Learning (ICDL).

[26] Martin A. Riedmiller,et al. Autonomous reinforcement learning on raw visual input data in a real world application , 2012, The 2012 International Joint Conference on Neural Networks (IJCNN).

[27] James B. Rawlings,et al. Postface to “ Model Predictive Control : Theory and Design ” , 2012 .

[28] Alex Graves,et al. Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[29] Jürgen Schmidhuber,et al. Evolving large-scale neural networks for vision-based reinforcement learning , 2013, GECCO '13.

[30] Byron Boots,et al. Learning predictive models of a depth camera & manipulator from raw execution traces , 2014, 2014 IEEE International Conference on Robotics and Automation (ICRA).

[31] David Q. Mayne,et al. Model predictive control: Recent developments and future promise , 2014, Autom..

[32] Ludovic Denoyer,et al. Learning States Representations in POMDP , 2014, ICLR.

[33] Thomas B. Schön,et al. Learning deep dynamical models from image pixels , 2014, ArXiv.

[34] Yunpeng Pan,et al. Probabilistic Differential Dynamic Programming , 2014, NIPS.

[35] Martin A. Riedmiller,et al. Approximate real-time optimal control based on sparse Gaussian process models , 2014, 2014 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL).

[36] S. Khosravi,et al. Constrained model predictive control of hypnosis , 2015 .

[37] Jan Peters,et al. Learning of Non-Parametric Control Policies with High-Dimensional State Features , 2015, AISTATS.

[38] Carl E. Rasmussen,et al. Gaussian Processes for Data-Efficient Learning in Robotics and Control , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[39] Shane Legg,et al. Human-level control through deep reinforcement learning , 2015, Nature.

[40] Sergey Levine,et al. End-to-End Training of Deep Visuomotor Policies , 2015, J. Mach. Learn. Res..

[41] Stefan Schaal,et al. Learning from Demonstration , 1996, NIPS.