A New Path: Scaling Vision-and-Language Navigation with Synthetic Instructions and Imitation Learning
暂无分享,去创建一个
Jing Yu Koh | Alexander Ku | Austin Waters | Jason Baldridge | Peter Anderson | Zarana Parekh | Yinfei Yang | Aishwarya Kamath | Su Wang
[1] Ashish V. Thapliyal,et al. PaLI: A Jointly-Scaled Multilingual Language-Image Model , 2022, arXiv.org.
[2] Mohit Bansal,et al. CLEAR: Improving Vision-Language Navigation with Cross-Lingual, Environment-Agnostic Representations , 2022, NAACL-HLT.
[3] Jing Yu Koh,et al. Scaling Autoregressive Models for Content-Rich Text-to-Image Generation , 2022, Trans. Mach. Learn. Res..
[4] Yann LeCun,et al. Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone , 2022, NeurIPS.
[5] Dhruv Batra,et al. Habitat-Web: Learning Embodied Object-Search Strategies from Human Demonstrations at Scale , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[6] Jing Yu Koh,et al. Simple and Effective Synthesis of Indoor 3D Scenes , 2022, AAAI Conference on Artificial Intelligence.
[7] Mohit Bansal,et al. Envedit: Environment Editing for Vision-and-Language Navigation , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[8] Jingren Zhou,et al. OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework , 2022, ICML.
[9] Liunian Harold Li,et al. Grounded Language-Image Pre-training , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[10] Aleksandra Faust,et al. Less is More: Generating Grounded Navigation Instructions from Landmarks , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[11] Adams Wei Yu,et al. SimVLM: Simple Visual Language Model Pretraining with Weak Supervision , 2021, ICLR.
[12] Kurt Keutzer,et al. How Much Can CLIP Benefit Vision-and-Language Tasks? , 2021, ICLR.
[13] Sugato Basu,et al. Diagnosing Vision-and-Language Navigation: What Really Matters , 2021, NAACL.
[14] Arjun Majumdar,et al. SOAT: A Scene- and Object-Aware Transformer for Vision-and-Language Navigation , 2021, NeurIPS.
[15] Cordelia Schmid,et al. History Aware Multimodal Transformer for Vision-and-Language Navigation , 2021, NeurIPS.
[16] Cordelia Schmid,et al. Airbert: In-domain Pretraining for Vision-and-Language Navigation , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[17] Xiaojun Chang,et al. Vision-Language Navigation with Random Environmental Mixup , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[18] Yejin Choi,et al. VinVL: Revisiting Visual Representations in Vision-Language Models , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[19] Jason Baldridge,et al. Pathdreamer: A World Model for Indoor Navigation , 2021, ALVR.
[20] Yann LeCun,et al. MDETR - Modulated Detection for End-to-End Multi-Modal Understanding , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[21] Ilya Sutskever,et al. Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.
[22] Alec Radford,et al. Zero-Shot Text-to-Image Generation , 2021, ICML.
[23] Zhe Gan,et al. Less is More: CLIPBERT for Video-and-Language Learning via Sparse Sampling , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[24] Ming Zhao,et al. On the Evaluation of Vision-and-Language Navigation Instructions , 2021, EACL.
[25] Stephen Gould,et al. VLN↻BERT: A Recurrent Vision-and-Language BERT for Navigation , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[26] Colin Raffel,et al. mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer , 2020, NAACL.
[27] S. Gelly,et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.
[28] Jason Baldridge,et al. MURAL: Multimodal, Multitask Representations Across Languages , 2021, EMNLP.
[29] Dhruv Batra,et al. Sim-to-Real Transfer for Vision-and-Language Navigation , 2020, CoRL.
[30] James M. Rehg,et al. Where Are You? Localization from Embodied Dialog , 2020, EMNLP.
[31] Stephen Gould,et al. Language and Visual Entity Relationship Graph for Agent Navigation , 2020, NeurIPS.
[32] Jason Baldridge,et al. Room-Across-Room: Multilingual Vision-and-Language Navigation with Dense Spatiotemporal Grounding , 2020, EMNLP.
[33] Hao Tan,et al. Diagnosing the Environment Bias in Vision-and-Language Navigation , 2020, IJCAI.
[34] Arjun Majumdar,et al. Improving Vision-and-Language Navigation with Image-Text Pairs from the Web , 2020, ECCV.
[35] Jacob Andreas,et al. Experience Grounds Language , 2020, EMNLP.
[36] Jianfeng Gao,et al. Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks , 2020, ECCV.
[37] L. Carin,et al. Towards Learning a Generic Agent for Vision-and-Language Navigation via Pre-Training , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[38] Jason Baldridge,et al. Retouchdown: Adding Touchdown to StreetLearn as a Shareable Resource for Language Grounding Tasks in Street View , 2020, ArXiv.
[39] Colin Raffel,et al. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..
[40] A. V. Hengel,et al. REVERIE: Remote Embodied Visual Referring Expression in Real Indoor Environments , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[41] Jianfeng Gao,et al. Robust Navigation with Language Pretraining and Stochastic Sampling , 2019, EMNLP.
[42] Mohit Bansal,et al. LXMERT: Learning Cross-Modality Encoder Representations from Transformers , 2019, EMNLP.
[43] Gabriel Magalhaes,et al. Effective and General Evaluation for Instruction Conditioned Navigation using Dynamic Time Warping , 2019, 1907.05446.
[44] Jesse Thomason,et al. Vision-and-Dialog Navigation , 2019, CoRL.
[45] Quoc V. Le,et al. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks , 2019, ICML.
[46] Licheng Yu,et al. Learning to Navigate Unseen Environments: Back Translation with Environmental Dropout , 2019, NAACL.
[47] Jitendra Malik,et al. Habitat: A Platform for Embodied AI Research , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[48] Ghassan Al-Regib,et al. Self-Monitoring Navigation Agent via Auxiliary Progress Estimation , 2019, ICLR.
[49] Yoav Artzi,et al. TOUCHDOWN: Natural Language Navigation and Spatial Reasoning in Visual Street Environments , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[50] Yuan-Fang Wang,et al. Reinforced Cross-Modal Matching and Self-Supervised Imitation Learning for Vision-Language Navigation , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[51] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.
[52] Taku Kudo,et al. SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing , 2018, EMNLP.
[53] Jitendra Malik,et al. On Evaluation of Embodied Navigation Agents , 2018, ArXiv.
[54] Radu Soricut,et al. Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning , 2018, ACL.
[55] Fei Luo,et al. RedNet: Residual Encoder-Decoder Network for indoor RGB-D Semantic Segmentation , 2018, ArXiv.
[56] Dan Klein,et al. Speaker-Follower Models for Vision-and-Language Navigation , 2018, NeurIPS.
[57] Jitendra Malik,et al. Gibson Env: Real-World Perception for Embodied Agents , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[58] Qi Wu,et al. Vision-and-Language Navigation: Interpreting Visually-Grounded Navigation Instructions in Real Environments , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[59] Matthias Nießner,et al. Matterport3D: Learning from RGB-D Data in Indoor Environments , 2017, 2017 International Conference on 3D Vision (3DV).
[60] John Langford,et al. Mapping Instructions and Visual Observations to Actions with Reinforcement Learning , 2017, EMNLP.
[61] Alex Graves,et al. Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.
[62] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[63] Rico Sennrich,et al. Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.
[64] Matthew R. Walter,et al. Listen, Attend, and Walk: Neural Mapping of Navigational Instructions to Action Sequences , 2015, AAAI.
[65] Dan Klein,et al. Alignment-Based Compositional Semantics for Instruction Following , 2015, EMNLP.
[66] Kaiming He,et al. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.
[67] Michael S. Bernstein,et al. ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.
[68] Luke S. Zettlemoyer,et al. Weakly Supervised Learning of Semantic Parsers for Mapping Instructions to Actions , 2013, TACL.
[69] Raymond J. Mooney,et al. Learning to Interpret Natural Language Navigation Instructions from Observations , 2011, Proceedings of the AAAI Conference on Artificial Intelligence.
[70] Geoffrey J. Gordon,et al. A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning , 2010, AISTATS.
[71] J. Andrew Bagnell,et al. Efficient Reductions for Imitation Learning , 2010, AISTATS.
[72] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.
[73] Terry Winograd,et al. Procedures As A Representation For Data In A Computer Program For Understanding Natural Language , 1971 .
[74] J. Kruskal. On the shortest spanning subtree of a graph and the traveling salesman problem , 1956 .