Robust Robotic Control from Pixels using Contrastive Recurrent State-Space Models

Modeling the world can benefit robot learning by providing a rich 1 training signal for shaping an agent’s latent state space. However, learning world 2 models in unconstrained environments over high-dimensional observation spaces 3 such as images is challenging. One source of difficulty is the presence of irrelevant 4 but hard-to-model background distractions, and unimportant visual details of task5 relevant entities. We address this issue by learning a recurrent latent dynamics 6 model which contrastively predicts the next observation. This simple model leads 7 to surprisingly robust robotic control even with simultaneous camera, background, 8 and color distractions. We outperform alternatives such as bisimulation methods 9 which impose state-similarity measures derived from divergence in future reward or 10 future optimal actions. We obtain state-of-the-art results on the Distracting Control 11 Suite, a challenging benchmark for pixel-based robotic control. 12

[1]  Geoffrey E. Hinton,et al.  Three new graphical models for statistical language modelling , 2007, ICML '07.

[2]  Aapo Hyvärinen,et al.  Unsupervised Feature Extraction by Time-Contrastive Learning and Nonlinear ICA , 2016, NIPS.

[3]  Pieter Abbeel,et al.  Decoupling Representation Learning from Reinforcement Learning , 2020, ICML.

[4]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[5]  Kaiming He,et al.  Momentum Contrast for Unsupervised Visual Representation Learning , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Ang Li,et al.  Prediction, Consistency, Curvature: Representation Learning for Locally-Linear Control , 2020, ICLR.

[7]  Julien Mairal,et al.  Unsupervised Learning of Visual Features by Contrasting Cluster Assignments , 2020, NeurIPS.

[8]  Oriol Vinyals,et al.  Representation Learning with Contrastive Predictive Coding , 2018, ArXiv.

[9]  Geoffrey E. Hinton,et al.  A Simple Framework for Contrastive Learning of Visual Representations , 2020, ICML.

[10]  Xiao Ma,et al.  Contrastive Variational Reinforcement Learning for Complex Observations , 2020, CoRL.

[11]  Yuval Tassa,et al.  MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[12]  Ruben Villegas,et al.  Learning Latent Dynamics for Planning from Pixels , 2018, ICML.

[13]  James Philbin,et al.  FaceNet: A unified embedding for face recognition and clustering , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Sergey Levine,et al.  Time-Contrastive Networks: Self-Supervised Learning from Video , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[15]  Sergey Levine,et al.  Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , 2018, ICML.

[16]  Stefano Ermon,et al.  Temporal Predictive Coding For Model-Based Planning In Latent Space , 2021, ICML.

[17]  Nitish Srivastava Unsupervised Learning of Visual Representations using Videos , 2015 .

[18]  Rowan McAllister,et al.  Learning Invariant Representations for Reinforcement Learning without Reconstruction , 2020, ICLR.

[19]  Yoshua Bengio,et al.  Learning deep representations by mutual information estimation and maximization , 2018, ICLR.

[20]  W. Hager,et al.  and s , 2019, Shallow Water Hydraulics.

[21]  Unlocking Pixels for Reinforcement Learning via Implicit Attention , 2021, ArXiv.

[22]  Sergey Levine,et al.  Stochastic Latent Actor-Critic: Deep Reinforcement Learning with a Latent Variable Model , 2019, NeurIPS.

[23]  Marlos C. Machado,et al.  Contrastive Behavioral Similarity Embeddings for Generalization in Reinforcement Learning , 2021, ICLR.

[24]  Sergey Levine,et al.  Unsupervised Learning for Physical Interaction through Video Prediction , 2016, NIPS.

[25]  Roberto Mart'in-Mart'in,et al.  robosuite: A Modular Simulation Framework and Benchmark for Robot Learning , 2020, ArXiv.

[26]  Kilian Q. Weinberger,et al.  Distance Metric Learning for Large Margin Nearest Neighbor Classification , 2005, NIPS.

[27]  Tadahiro Taniguchi,et al.  Dreaming: Model-based Reinforcement Learning by Latent Imagination without Reconstruction , 2020, ArXiv.

[28]  Danna Zhou,et al.  d. , 1840, Microbial pathogenesis.

[29]  Sanja Fidler,et al.  Skip-Thought Vectors , 2015, NIPS.

[30]  W. Marsden I and J , 2012 .

[31]  M. A. Griffin,et al.  Information Processing Systems , 1976 .

[32]  Marc G. Bellemare,et al.  DeepMDP: Learning Continuous Latent Space Models for Representation Learning , 2019, ICML.

[33]  Ilya Kostrikov,et al.  Image Augmentation Is All You Need: Regularizing Deep Reinforcement Learning from Pixels , 2020, ArXiv.

[34]  Michal Valko,et al.  Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning , 2020, NeurIPS.

[35]  F. R. Rosendaal,et al.  Prediction , 2015, Journal of thrombosis and haemostasis : JTH.

[36]  Joelle Pineau,et al.  Improving Sample Efficiency in Model-Free Reinforcement Learning from Images , 2019, AAAI.

[37]  Sergey Levine,et al.  QT-Opt: Scalable Deep Reinforcement Learning for Vision-Based Robotic Manipulation , 2018, CoRL.

[38]  Rico Jonschkowski,et al.  The Distracting Control Suite - A Challenging Benchmark for Reinforcement Learning from Pixels , 2021, ArXiv.

[39]  Sepp Hochreiter,et al.  Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs) , 2015, ICLR.

[40]  Daniel Guo,et al.  Bootstrap Latent-Predictive Representations for Multitask Reinforcement Learning , 2020, ICML.

[41]  Pieter Abbeel,et al.  CURL: Contrastive Unsupervised Representations for Reinforcement Learning , 2020, ICML.

[42]  Yujin Tang,et al.  Neuroevolution of self-interpretable agents , 2020, GECCO.

[43]  Miss A.O. Penney (b) , 1974, The New Yale Book of Quotations.

[44]  Fabio Viola,et al.  The Kinetics Human Action Video Dataset , 2017, ArXiv.

[45]  Martin A. Riedmiller,et al.  Embed to Control: A Locally Linear Latent Dynamics Model for Control from Raw Images , 2015, NIPS.

[46]  Mohammad Norouzi,et al.  Dream to Control: Learning Behaviors by Latent Imagination , 2019, ICLR.

[47]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[48]  Thomas de Quincey [C] , 2000, The Works of Thomas De Quincey, Vol. 1: Writings, 1799–1820.

[49]  Yoshua Bengio,et al.  On the Properties of Neural Machine Translation: Encoder–Decoder Approaches , 2014, SSST@EMNLP.

[50]  Christopher Burgess,et al.  beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework , 2016, ICLR 2016.

[51]  Yinlam Chow,et al.  Control-Aware Representations for Model-based Reinforcement Learning , 2020, ICLR.

[52]  Pieter Abbeel,et al.  Reinforcement Learning with Augmented Data , 2020, NeurIPS.