Supplementary Material for “Meta-Explore: Exploratory Hierarchical Vision-and-Language Navigation Using Scene Object Spectrum Grounding”

In the proposed method, we approximate the reference SOS feature of an object token by using prior information about objects in the training data. For instance, for the ‘chair’ object, we collected the widths and heights of the detected bounding boxes as shown in Figure 1. Figure 2 shows two representative values: median and mean for each distribution. We choose the median values, which minimizes the L1 error, to represent the reference bounding box of each object. To generate rotation-invariant SOS features, we convert the four vertices of the bounding box detected from the front view image of size 640 × 480 to the vertices of a bounding box detected from the panoramic view image of size 2048 × 512 using coordinate transformations. To simplify the implementation, we assume that the converted bounding box has a rectangle shape with the vertices transformed into coordinates in a panoramic view. The reference SOS feature is calculated as the logarithmic magnitude of the Fourier transform of the panoramic mask with mean pooling on the vertical spectral axis. Considering that the shift in the spatial-domain only affects the phase of the Fourier transform, the location of a reference bounding box does not matter.

[1]  Qi Wu,et al.  Bridging the Gap Between Learning in Discrete and Continuous Environments for Vision-and-Language Navigation , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  C. Schmid,et al.  Think Global, Act Local: Dual-scale Graph Transformer for Vision-and-Language Navigation , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  A. Schwing,et al.  Masked-attention Mask Transformer for Universal Image Segmentation , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  C. Schmid,et al.  History Aware Multimodal Transformer for Vision-and-Language Navigation , 2021, NeurIPS.

[5]  Songhwai Oh,et al.  Visual Graph Memory with Unsupervised Representation for Visual Navigation , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[6]  Supun Samarasekera,et al.  Semantically-aware Spatio-temporal Reasoning Agent for Vision-and-Language Navigation in Continuous Environments , 2021, 2022 26th International Conference on Pattern Recognition (ICPR).

[7]  Cordelia Schmid,et al.  Airbert: In-domain Pretraining for Vision-and-Language Navigation , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[8]  Angel X. Chang,et al.  Habitat 2.0: Training Home Assistants to Rearrange their Habitat , 2021, NeurIPS.

[9]  Chih-Yao Ma,et al.  Hierarchical Cross-Modal Agent for Robotics Vision-and-Language Navigation , 2021, 2021 IEEE International Conference on Robotics and Automation (ICRA).

[10]  Xiaojun Chang,et al.  SOON: Scenario Oriented Object Navigation with Graph-based Exploration , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Yizhou Yu,et al.  Scene-Intuitive Agent for Remote Embodied Visual Grounding , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Caiming Xiong,et al.  Structured Scene Memory for Vision-Language Navigation , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[14]  Stephen Gould,et al.  VLN↻BERT: A Recurrent Vision-and-Language BERT for Navigation , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Olivier Simonin,et al.  Learning to plan with uncertain topological maps , 2020, ECCV.

[16]  Ruslan Salakhutdinov,et al.  Neural Topological SLAM for Visual Navigation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Junnan Li,et al.  Prototypical Contrastive Learning of Unsupervised Representations , 2020, ICLR.

[18]  Ruslan Salakhutdinov,et al.  Learning to Explore using Active Neural SLAM , 2020, ICLR.

[19]  Jacob Krantz,et al.  Beyond the Nav-Graph: Vision-and-Language Navigation in Continuous Environments , 2020, ECCV.

[20]  Xiaojun Chang,et al.  Vision-Language Navigation With Self-Supervised Auxiliary Reasoning Tasks , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Mohit Bansal,et al.  LXMERT: Learning Cross-Modality Encoder Representations from Transformers , 2019, EMNLP.

[22]  Iryna Gurevych,et al.  Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks , 2019, EMNLP.

[23]  Stefan Lee,et al.  ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks , 2019, NeurIPS.

[24]  A. V. Hengel,et al.  REVERIE: Remote Embodied Visual Referring Expression in Real Indoor Environments , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Siddhartha S. Srinivasa,et al.  Tactical Rewind: Self-Correction via Backtracking in Vision-And-Language Navigation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Ghassan Al-Regib,et al.  The Regretful Agent: Heuristic-Aided Navigation Through Progress Estimation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Ghassan Al-Regib,et al.  Self-Monitoring Navigation Agent via Auxiliary Progress Estimation , 2019, ICLR.

[28]  Yuan-Fang Wang,et al.  Reinforced Cross-Modal Matching and Self-Supervised Imitation Learning for Vision-Language Navigation , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Jitendra Malik,et al.  On Evaluation of Embodied Navigation Agents , 2018, ArXiv.

[30]  Trevor Darrell,et al.  Speaker-Follower Models for Vision-and-Language Navigation , 2018, NeurIPS.

[31]  Jitendra Malik,et al.  Gibson Env: Real-World Perception for Embodied Agents , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[32]  Qi Wu,et al.  Vision-and-Language Navigation: Interpreting Visually-Grounded Navigation Instructions in Real Environments , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[33]  Bolei Zhou,et al.  Scene Parsing through ADE20K Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Margaret Mitchell,et al.  VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[36]  Edsger W. Dijkstra,et al.  A note on two problems in connexion with graphs , 1959, Numerische Mathematik.

[37]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .