Auxiliary Tasks Speed Up Learning PointGoal Navigation

PointGoal Navigation is an embodied task that requires agents to navigate to a specified point in an unseen environment. Wijmans et al. showed that this task is solvable but their method is computationally prohibitive, requiring 2.5 billion frames and 180 GPU-days. In this work, we develop a method to significantly increase sample and time efficiency in learning PointNav using self-supervised auxiliary tasks (e.g. predicting the action taken between two egocentric observations, predicting the distance between two observations from a trajectory,etc.).We find that naively combining multiple auxiliary tasks improves sample efficiency,but only provides marginal gains beyond a point. To overcome this, we use attention to combine representations learnt from individual auxiliary tasks. Our best agent is 5x faster to reach the performance of the previous state-of-the-art, DD-PPO, at 40M frames, and improves on DD-PPO's performance at40M frames by 0.16 SPL. Our code is publicly available at this https URL.

[1]  Paolo Favaro,et al.  Unsupervised Learning of Visual Representations by Solving Jigsaw Puzzles , 2016, ECCV.

[2]  Alexei A. Efros,et al.  Context Encoders: Feature Learning by Inpainting , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Dhruv Batra,et al.  SplitNet: Sim2Sim and Task2Task Transfer for Embodied Visual Navigation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[4]  Tom Schaul,et al.  Reinforcement Learning with Unsupervised Auxiliary Tasks , 2016, ICLR.

[5]  Jitendra Malik,et al.  Mid-Level Visual Representations Improve Generalization and Sample Efficiency for Learning Visuomotor Policies , 2018 .

[6]  Laurens van der Maaten,et al.  Self-Supervised Learning of Pretext-Invariant Representations , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Yoshua Bengio,et al.  Unsupervised State Representation Learning in Atari , 2019, NeurIPS.

[8]  Ghassan Al-Regib,et al.  Self-Monitoring Navigation Agent via Auxiliary Progress Estimation , 2019, ICLR.

[9]  Stefan Lee,et al.  Embodied Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[10]  Aaron van den Oord,et al.  Shaping Belief States with Generative Environment Models for RL , 2019, NeurIPS.

[11]  Kaiming He,et al.  Momentum Contrast for Unsupervised Visual Representation Learning , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Phillip Isola,et al.  Contrastive Multiview Coding , 2019, ECCV.

[13]  Razvan Pascanu,et al.  Learning to Navigate in Complex Environments , 2016, ICLR.

[14]  Sergey Levine,et al.  High-Dimensional Continuous Control Using Generalized Advantage Estimation , 2015, ICLR.

[15]  Charles A. Micchelli,et al.  Learning Multiple Tasks with Kernel Methods , 2005, J. Mach. Learn. Res..

[16]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[17]  Nitish Srivastava Unsupervised Learning of Visual Representations using Videos , 2015 .

[18]  Roberto Cipolla,et al.  Multi-task Learning Using Uncertainty to Weigh Losses for Scene Geometry and Semantics , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[19]  Pieter Abbeel,et al.  CURL: Contrastive Unsupervised Representations for Reinforcement Learning , 2020, ICML.

[20]  Qi Wu,et al.  Vision-and-Language Navigation: Interpreting Visually-Grounded Navigation Instructions in Real Environments , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[21]  Alexei A. Efros,et al.  Colorful Image Colorization , 2016, ECCV.

[22]  Leonidas J. Guibas,et al.  Taskonomy: Disentangling Task Transfer Learning , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[23]  Matthias Nießner,et al.  Matterport3D: Learning from RGB-D Data in Indoor Environments , 2017, 2017 International Conference on 3D Vision (3DV).

[24]  Jitendra Malik,et al.  Gibson Env: Real-World Perception for Embodied Agents , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[25]  Stefan Lee,et al.  Decentralized Distributed PPO: Solving PointGoal Navigation , 2019, ArXiv.

[26]  Jitendra Malik,et al.  Habitat: A Platform for Embodied AI Research , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[27]  Oriol Vinyals,et al.  Representation Learning with Contrastive Predictive Coding , 2018, ArXiv.

[28]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[29]  Kaiming He,et al.  Improved Baselines with Momentum Contrastive Learning , 2020, ArXiv.

[30]  Shane Legg,et al.  DeepMind Lab , 2016, ArXiv.

[31]  David Held,et al.  Adaptive Auxiliary Task Weighting for Reinforcement Learning , 2019, NeurIPS.

[32]  Stefan Lee,et al.  Embodied Question Answering in Photorealistic Environments With Point Cloud Perception , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[34]  Stefan Lee,et al.  Neural Modular Control for Embodied Question Answering , 2018, CoRL.

[35]  Alexei A. Efros,et al.  Curiosity-Driven Exploration by Self-Supervised Prediction , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[36]  Alexei A. Efros,et al.  Unsupervised Visual Representation Learning by Context Prediction , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[37]  Vladlen Koltun,et al.  Semi-parametric Topological Memory for Navigation , 2018, ICLR.

[38]  Ali Farhadi,et al.  IQA: Visual Question Answering in Interactive Environments , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[39]  Ruslan Salakhutdinov,et al.  Learning To Explore Using Active Neural Mapping , 2020, ICLR 2020.

[40]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Geoffrey E. Hinton,et al.  A Simple Framework for Contrastive Learning of Visual Representations , 2020, ICML.

[42]  Yoshua Bengio,et al.  Learning deep representations by mutual information estimation and maximization , 2018, ICLR.

[43]  Rob Fergus,et al.  Visualizing and Understanding Convolutional Networks , 2013, ECCV.

[44]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[45]  Jitendra Malik,et al.  On Evaluation of Embodied Navigation Agents , 2018, ArXiv.

[46]  N. Burgess The 2014 Nobel Prize in Physiology or Medicine: A Spatial Model for Cognitive Neuroscience , 2014, Neuron.

[47]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[48]  Ali Razavi,et al.  Data-Efficient Image Recognition with Contrastive Predictive Coding , 2019, ICML.

[49]  R Devon Hjelm,et al.  Learning Representations by Maximizing Mutual Information Across Views , 2019, NeurIPS.

[50]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[51]  Rémi Munos,et al.  Neural Predictive Belief Representations , 2018, ArXiv.

[52]  Leonidas J. Guibas,et al.  Situational Fusion of Visual Representation for Visual Navigation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).