论文信息 - Robust Policies via Mid-Level Visual Representations: An Experimental Study in Manipulation and Navigation

Robust Policies via Mid-Level Visual Representations: An Experimental Study in Manipulation and Navigation

Vision-based robotics often separates the control loop into one module for perception and a separate module for control. It is possible to train the whole system end-to-end (e.g. with deep RL), but doing it "from scratch" comes with a high sample complexity cost and the final result is often brittle, failing unexpectedly if the test environment differs from that of training. We study the effects of using mid-level visual representations (features learned asynchronously for traditional computer vision objectives), as a generic and easy-to-decode perceptual state in an end-to-end RL framework. Mid-level representations encode invariances about the world, and we show that they aid generalization, improve sample complexity, and lead to a higher final performance. Compared to other approaches for incorporating invariances, such as domain randomization, asynchronously trained mid-level representations scale better: both to harder problems and to larger domain shifts. In practice, this means that mid-level representations could be used to successfully train policies for tasks where domain randomization and learning-from-scratch failed. We report results on both manipulation and navigation tasks, and for navigation include zero-shot sim-to-real experiments on real robots.

[1] Vladlen Koltun,et al. Learning to Act by Predicting the Future , 2016, ICLR.

[2] Andrea Vedaldi,et al. Learning multiple visual domains with residual adapters , 2017, NIPS.

[3] Wojciech Zaremba,et al. Domain randomization for transferring deep neural networks from simulation to the real world , 2017, 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[4] Alexei A. Efros,et al. Unsupervised Visual Representation Learning by Context Prediction , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[5] Jitendra Malik,et al. Habitat: A Platform for Embodied AI Research , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[6] Oriol Vinyals,et al. Representation Learning with Contrastive Predictive Coding , 2018, ArXiv.

[7] Marcin Andrychowicz,et al. Multi-Goal Reinforcement Learning: Challenging Robotics Environments and Request for Research , 2018, ArXiv.

[8] Bernard Ghanem,et al. Driving Policy Transfer via Modularity and Abstraction , 2018, CoRL.

[9] Paolo Favaro,et al. Unsupervised Learning of Visual Representations by Solving Jigsaw Puzzles , 2016, ECCV.

[10] Alec Radford,et al. Proximal Policy Optimization Algorithms , 2017, ArXiv.

[11] Sergey Levine,et al. Deep spatial autoencoders for visuomotor learning , 2015, 2016 IEEE International Conference on Robotics and Automation (ICRA).

[12] Leonidas Guibas,et al. Side-Tuning: A Baseline for Network Adaptation via Additive Side Networks , 2019, ECCV.

[13] Ali Farhadi,et al. Visual Semantic Navigation using Scene Priors , 2018, ICLR.

[14] Rob Fergus,et al. Depth Map Prediction from a Single Image using a Multi-Scale Deep Network , 2014, NIPS.

[15] Supun Samarasekera,et al. Ten-fold Improvement in Visual Odometry Using Landmark Matching , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[16] Jitendra Malik,et al. On Evaluation of Embodied Navigation Agents , 2018, ArXiv.

[17] Sergey Levine,et al. Sim-To-Real via Sim-To-Sim: Data-Efficient Robotic Grasping via Randomized-To-Canonical Adaptation Networks , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[18] Michal Valko,et al. Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning , 2020, NeurIPS.

[19] Pascal Vincent,et al. Representation Learning: A Review and New Perspectives , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[20] Alexei A. Efros,et al. Self-Supervised Policy Adaptation during Deployment , 2020, ICLR.

[21] Andrew J. Davison,et al. RLBench: The Robot Learning Benchmark & Learning Environment , 2019, IEEE Robotics and Automation Letters.

[22] Surya P. N. Singh,et al. V-REP: A versatile and scalable robot simulation framework , 2013, 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[23] Ross B. Girshick,et al. Fast R-CNN , 2015, 1504.08083.

[24] Pieter Abbeel,et al. Learning to Manipulate Deformable Objects without Demonstrations , 2019, Robotics: Science and Systems.

[25] Jitendra Malik,et al. Learning to Poke by Poking: Experiential Learning of Intuitive Physics , 2016, NIPS.

[26] Jitendra Malik,et al. Gibson Env: Real-World Perception for Embodied Agents , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[27] Marcin Andrychowicz,et al. Asymmetric Actor Critic for Image-Based Robot Learning , 2017, Robotics: Science and Systems.

[28] Yoshua Bengio,et al. Extracting and composing robust features with denoising autoencoders , 2008, ICML '08.

[29] Ali Razavi,et al. Data-Efficient Image Recognition with Contrastive Predictive Coding , 2019, ICML.

[30] Pieter Abbeel,et al. Learning Predictive Representations for Deformable Objects Using Contrastive Estimation , 2020, CoRL.

[31] Derek Hoiem,et al. Indoor Segmentation and Support Inference from RGBD Images , 2012, ECCV.

[32] Abhinav Gupta,et al. Robot Learning in Homes: Improving Generalization and Reducing Dataset Bias , 2018, NeurIPS.

[33] Alexei A. Efros,et al. Test-Time Training for Out-of-Distribution Generalization , 2019, ArXiv.

[34] Ying Liu,et al. Deep Reinforcement Learning for Dynamic Treatment Regimes on Medical Registry Data , 2017, 2017 IEEE International Conference on Healthcare Informatics (ICHI).

[35] Russ Tedrake,et al. Dense Object Nets: Learning Dense Visual Object Descriptors By and For Robotic Manipulation , 2018, CoRL.

[36] Marcin Andrychowicz,et al. Hindsight Experience Replay , 2017, NIPS.

[37] Alexei A. Efros,et al. Test-Time Training with Self-Supervision for Generalization under Distribution Shifts , 2019, ICML.

[38] Craig Boutilier,et al. Data center cooling using model-predictive control , 2018, NeurIPS.

[39] Sergey Levine,et al. End-to-End Training of Deep Visuomotor Policies , 2015, J. Mach. Learn. Res..

[40] Marcin Andrychowicz,et al. Solving Rubik's Cube with a Robot Hand , 2019, ArXiv.

[41] Alexei A. Efros,et al. Curiosity-Driven Exploration by Self-Supervised Prediction , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[42] Herke van Hoof,et al. Addressing Function Approximation Error in Actor-Critic Methods , 2018, ICML.

[43] Abhinav Gupta,et al. The Curious Robot: Learning Visual Representations via Physical Interactions , 2016, ECCV.

[44] Andy Zeng,et al. Learning to See before Learning to Act: Visual Pre-training for Manipulation , 2020, 2020 IEEE International Conference on Robotics and Automation (ICRA).

[45] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[46] Andrew J. Davison,et al. PyRep: Bringing V-REP to Deep Robot Learning , 2019, ArXiv.

[47] Christopher Burgess,et al. beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework , 2016, ICLR 2016.

[48] P. Cochat,et al. Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[49] Vladlen Koltun,et al. Does computer vision matter for action? , 2019, Science Robotics.

[50] Geoffrey E. Hinton,et al. Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[51] Demis Hassabis,et al. Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[52] Jana Kosecka,et al. Visual Representations for Semantic Target Driven Navigation , 2018, 2019 International Conference on Robotics and Automation (ICRA).

[53] Silvio Savarese,et al. Learning to Navigate Using Mid-Level Visual Priors , 2019, CoRL.

[54] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[55] Geoffrey E. Hinton,et al. A Simple Framework for Contrastive Learning of Visual Representations , 2020, ICML.

[56] Jakub W. Pachocki,et al. Learning dexterous in-hand manipulation , 2018, Int. J. Robotics Res..

[57] W. Marsden. I and J , 2012 .

[58] David Filliat,et al. S-RL Toolbox: Environments, Datasets and Evaluation Metrics for State Representation Learning , 2018, ArXiv.

[59] Sergey Levine,et al. (CAD)$^2$RL: Real Single-Image Flight without a Single Real Image , 2016, Robotics: Science and Systems.

[60] Leonidas J. Guibas,et al. Taskonomy: Disentangling Task Transfer Learning , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[61] Max Welling,et al. Auto-Encoding Variational Bayes , 2013, ICLR.

[62] Alex Graves,et al. Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[63] Yu Zhong,et al. Intrinsic shape signatures: A shape descriptor for 3D object recognition , 2009, 2009 IEEE 12th International Conference on Computer Vision Workshops, ICCV Workshops.

[64] Kaiming He,et al. Momentum Contrast for Unsupervised Visual Representation Learning , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[65] Yaser Sheikh,et al. OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[66] Michael I. Jordan,et al. Forward Models: Supervised Learning with a Distal Teacher , 1992, Cogn. Sci..