Zero Experience Required: Plug & Play Modular Transfer Learning for Semantic Visual Navigation

In reinforcement learning for visual navigation, it is common to develop a model for each new task, and train that model from scratch with task-specific interactions in 3D environments. However, this process is expensive; mas-sive amounts of interactions are needed for the model to generalize well. Moreover, this process is repeated when-ever there is a change in the task type or the goal modality. We present a unified approach to visual navigation using a novel modular transfer learning model. Our model can ef-fectively leverage its experience from one source task and apply it to multiple target tasks (e.g., ObjectNav, Room-Nav, Vi ewNav) with various goal modalities (e.g., image, sketch, audio, label). Furthermore, our model enables zero-shot experience learning, whereby it can solve the target tasks without receiving any task-specific interactive training. Our experiments on multiple photorealistic datasets and challenging tasks show that our approach learns faster, generalizes better, and outperforms SoTA models by a sig-nificant margin. Project page: https://vision.cs.utexas.edu/projects/zsel/

[1]  Santhosh K. Ramakrishnan,et al.  PONI: Potential Functions for ObjectGoal Navigation with Interaction-free Learning , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Shubham Tulsiani,et al.  No RL, No Simulation: Learning to Navigate without Navigating , 2021, NeurIPS.

[3]  Songhwai Oh,et al.  Visual Graph Memory with Unsupervised Representation for Visual Navigation , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[4]  Dhruv Batra,et al.  THDA: Treasure Hunt Data Augmentation for Semantic Navigation , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[5]  Dhruv Batra,et al.  Auxiliary Tasks and Exploration Enable ObjectGoal Navigation , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[6]  Angel X. Chang,et al.  Habitat-Matterport 3D Dataset (HM3D): 1000 Large-scale 3D Environments for Embodied AI , 2021, NeurIPS Datasets and Benchmarks.

[7]  Phillip Isola,et al.  Curious Representation Learning for Embodied Intelligence , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[8]  Vladlen Koltun,et al.  Large Batch Simulation for Deep Reinforcement Learning , 2021, ICLR.

[9]  Santhosh K. Ramakrishnan,et al.  Environment Predictive Coding for Embodied Agents , 2021, ArXiv.

[10]  Sainbayar Sukhbaatar,et al.  Memory-Augmented Reinforcement Learning for Image-Goal Navigation , 2021, 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[11]  Kristen Grauman,et al.  Semantic Audio-Visual Navigation , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Santhosh K. Ramakrishnan,et al.  Learning to Set Waypoints for Audio-Visual Navigation , 2020, ICLR.

[13]  R. Fergus,et al.  Image Augmentation Is All You Need: Regularizing Deep Reinforcement Learning from Pixels , 2020, ICLR.

[14]  Rita Cucchiara,et al.  Multimodal attention networks for low-level vision-and-language navigation , 2019, Comput. Vis. Image Underst..

[15]  Santhosh K. Ramakrishnan,et al.  Occupancy Anticipation for Efficient Exploration and Navigation , 2020, ECCV.

[16]  Xinlei Chen,et al.  Seeing the Un-Scene: Learning Amodal Semantic Maps for Room Navigation , 2020, ECCV.

[17]  Alexander Toshev,et al.  ObjectNav Revisited: On Evaluation of Embodied Agents Navigating to Objects , 2020, ArXiv.

[18]  Julien Mairal,et al.  Unsupervised Learning of Visual Features by Contrasting Cluster Assignments , 2020, NeurIPS.

[19]  Ruslan Salakhutdinov,et al.  Neural Topological SLAM for Visual Navigation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Pieter Abbeel,et al.  Planning to Explore via Self-Supervised World Models , 2020, ICML.

[21]  Andy Zeng,et al.  Learning to See before Learning to Act: Visual Pre-training for Manipulation , 2020, 2020 IEEE International Conference on Robotics and Automation (ICRA).

[22]  P. Abbeel,et al.  Reinforcement Learning with Augmented Data , 2020, NeurIPS.

[23]  Kaiming He,et al.  Improved Baselines with Momentum Contrastive Learning , 2020, ArXiv.

[24]  Michael Milford,et al.  MVP: Unified Motion and Visual Self-Supervised Learning for Large-Scale Robotic Navigation , 2020, ArXiv.

[25]  William Yang Wang,et al.  Environment-agnostic Multitask Learning for Natural Language Grounded Navigation , 2020, ECCV.

[26]  Geoffrey E. Hinton,et al.  A Simple Framework for Contrastive Learning of Visual Representations , 2020, ICML.

[27]  J. Tenenbaum,et al.  Look, Listen, and Act: Towards Audio-Visual Embodied Navigation , 2019, 2020 IEEE International Conference on Robotics and Automation (ICRA).

[28]  K. Grauman,et al.  SoundSpaces: Audio-Visual Navigation in 3D Environments , 2019, ECCV.

[29]  Sonia Chernova,et al.  Sim2Real Predictivity: Does Evaluation in Simulation Predict Real-World Performance? , 2019, IEEE Robotics and Automation Letters.

[30]  William Yang Wang,et al.  Unsupervised Reinforcement Learning of Transferable Meta-Skills for Embodied Navigation , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Ari S. Morcos,et al.  DD-PPO: Learning Near-Perfect PointGoal Navigators from 2.5 Billion Frames , 2019, ICLR.

[32]  Ruslan Salakhutdinov,et al.  Embodied Multimodal Multitask Learning , 2019, IJCAI.

[33]  Silvio Savarese,et al.  Learning to Navigate Using Mid-Level Visual Priors , 2019, CoRL.

[34]  Silvio Savarese,et al.  3D Scene Graph: A Structure for Unified Semantics, 3D Space, and Camera , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[35]  Yuandong Tian,et al.  Bayesian Relational Memory for Semantic Visual Navigation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[36]  Leonidas J. Guibas,et al.  Situational Fusion of Visual Representation for Visual Navigation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[37]  Vladlen Koltun,et al.  Does computer vision matter for action? , 2019, Science Robotics.

[38]  Dhruv Batra,et al.  SplitNet: Sim2Sim and Task2Task Transfer for Embodied Visual Navigation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[39]  Jitendra Malik,et al.  Habitat: A Platform for Embodied AI Research , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[40]  Ali Farhadi,et al.  Learning to Learn How to Learn: Self-Adaptive Visual Navigation Using Meta-Learning , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Ali Farhadi,et al.  Visual Semantic Navigation using Scene Priors , 2018, ICLR.

[42]  Jana Kosecka,et al.  Visual Representations for Semantic Target Driven Navigation , 2018, 2019 International Conference on Robotics and Automation (ICRA).

[43]  Sergey Levine,et al.  Diversity is All You Need: Learning Skills without a Reward Function , 2018, ICLR.

[44]  Jitendra Malik,et al.  On Evaluation of Embodied Navigation Agents , 2018, ArXiv.

[45]  Jitendra Malik,et al.  Gibson Env: Real-World Perception for Embodied Agents , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[46]  Bernard Ghanem,et al.  Driving Policy Transfer via Modularity and Abstraction , 2018, CoRL.

[47]  Leonidas J. Guibas,et al.  Taskonomy: Disentangling Task Transfer Learning , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[48]  Raia Hadsell,et al.  Learning to Navigate in Cities Without a Map , 2018, NeurIPS.

[49]  Vladlen Koltun,et al.  Semi-parametric Topological Memory for Navigation , 2018, ICLR.

[50]  Qi Wu,et al.  Vision-and-Language Navigation: Interpreting Visually-Grounded Navigation Instructions in Real Environments , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[51]  Ali Farhadi,et al.  AI2-THOR: An Interactive 3D Environment for Visual AI , 2017, ArXiv.

[52]  Thomas A. Funkhouser,et al.  MINOS: Multimodal Indoor Simulator for Navigation in Complex Environments , 2017, ArXiv.

[53]  Matthias Nießner,et al.  Matterport3D: Learning from RGB-D Data in Indoor Environments , 2017, 2017 International Conference on 3D Vision (3DV).

[54]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[55]  Rainer Stiefelhagen,et al.  Automatic Discovery, Association Estimation and Learning of Semantic Attributes for a Thousand Categories , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[56]  Bernt Schiele,et al.  Zero-Shot Learning — The Good, the Bad and the Ugly , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[57]  Sergey Levine,et al.  Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks , 2017, ICML.

[58]  Ali Farhadi,et al.  Target-driven visual navigation in indoor scenes using deep reinforcement learning , 2016, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[59]  Rainer Stiefelhagen,et al.  Recovering the Missing Link: Predicting Class-Attribute Associations for Unsupervised Zero-Shot Learning , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[60]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[61]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[62]  Yoshua Bengio,et al.  On the Properties of Neural Machine Translation: Encoder–Decoder Approaches , 2014, SSST@EMNLP.

[63]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[64]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[65]  Babak Saleh,et al.  Write a Classifier: Zero-Shot Learning Using Purely Textual Descriptions , 2013, 2013 IEEE International Conference on Computer Vision.

[66]  Marc Alexa,et al.  How do humans sketch objects? , 2012, ACM Trans. Graph..

[67]  José Ruíz Ascencio,et al.  Visual simultaneous localization and mapping: a survey , 2012, Artificial Intelligence Review.

[68]  Bernt Schiele,et al.  Evaluating knowledge transfer and zero-shot learning in a large-scale setting , 2011, CVPR 2011.

[69]  Christoph H. Lampert,et al.  Learning to detect unseen object classes by between-class attribute transfer , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[70]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[71]  Yoshua Bengio,et al.  Zero-data Learning of New Tasks , 2008, AAAI.

[72]  Hugh Durrant-Whyte,et al.  Simultaneous localization and mapping (SLAM): part II , 2006 .

[73]  Sebastian Thrun,et al.  Probabilistic robotics , 2002, CACM.

[74]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.