Voxel-informed Language Grounding

Natural language applied to natural 2D images describes a fundamentally 3D world. We present the Voxel-informed Language Grounder (VLG), a language grounding model that leverages 3D geometric information in the form of voxel maps derived from the visual input using a volumetric reconstruction model. We show that VLG significantly improves grounding accuracy on SNARE, an object reference game task.At the time of writing, VLG holds the top place on the SNARE leaderboard, achieving SOTA results with a 2.0% absolute improvement.

[1]  L. Guibas,et al.  PartGlot: Learning Shape Part Segmentation from Language Reference Games , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Dong Xu,et al.  3DVG-Transformer: Relation Modeling for Visual Grounding on Point Clouds , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[3]  Mohit Shridhar,et al.  Language Grounding with 3D Objects , 2021, CoRL.

[4]  Dieter Fox,et al.  A Persistent Spatial Semantic Representation for High-level Natural Language Instruction Execution , 2021, CoRL.

[5]  Ali Farhadi,et al.  LanguageRefer: Spatial-Language Model for 3D Visual Grounding , 2021, CoRL.

[6]  Federico Tombari,et al.  LegoFormer: Transformers for Block-by-Block Multi-view 3D Reconstruction , 2021, ArXiv.

[7]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[8]  S. Gershman,et al.  Language-Mediated, Object-Centric Representation Learning , 2020, FINDINGS.

[9]  Dieter Fox,et al.  ACRONYM: A Large-Scale Grasp Dataset Based on Simulation , 2020, 2021 IEEE International Conference on Robotics and Automation (ICRA).

[10]  Jacob Andreas,et al.  Task-Oriented Dialogue as Dataflow Synthesis , 2020, Transactions of the Association for Computational Linguistics.

[11]  Ahmed Abdelreheem,et al.  ReferIt3D: Neural Listeners for Fine-Grained 3D Object Identification in Real-World Scenes , 2020, ECCV.

[12]  Ruslan Salakhutdinov,et al.  Object Goal Navigation using Goal-Oriented Semantic Exploration , 2020, NeurIPS.

[13]  Emily M. Bender,et al.  Climbing towards NLU: On Meaning, Form, and Understanding in the Age of Data , 2020, ACL.

[14]  Jacob Andreas,et al.  Experience Grounds Language , 2020, EMNLP.

[15]  Angel X. Chang,et al.  ScanRefer: 3D Object Localization in RGB-D Scans using Natural Language , 2019, ECCV.

[16]  Marcus Rohrbach,et al.  12-in-1: Multi-Task Vision and Language Representation Learning , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Adam W. Harley,et al.  Embodied Language Grounding With 3D Visual Feature Representations , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Stefan Lee,et al.  ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks , 2019, NeurIPS.

[19]  Christopher D. Manning,et al.  Learning by Abstraction: The Neural State Machine , 2019, NeurIPS.

[20]  Louis-Philippe Morency,et al.  Language2Pose: Natural Language Grounded Pose Forecasting , 2019, 2019 International Conference on 3D Vision (3DV).

[21]  Leonidas J. Guibas,et al.  Shapeglot: Learning Language for Shape Differentiation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[22]  Silvio Savarese,et al.  Text2Shape: Generating Shapes from Natural Language by Learning Joint Embeddings , 2018, ACCV.

[23]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[24]  Lei Zhang,et al.  Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[25]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[26]  Li Fei-Fei,et al.  CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Leonidas J. Guibas,et al.  ShapeNet: An Information-Rich 3D Model Repository , 2015, ArXiv.

[28]  Pat Hanrahan,et al.  Semantically-enriched 3D models for common-sense knowledge , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[29]  Angel X. Chang,et al.  Learning Spatial Knowledge for Text to 3D Scene Generation , 2014, EMNLP.

[30]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[31]  Jianxiong Xiao,et al.  3D ShapeNets: A deep representation for volumetric shapes , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Yejin Choi,et al.  Baby talk: Understanding and generating simple image descriptions , 2011, CVPR 2011.

[33]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[34]  Joelle Pineau,et al.  Towards robotic assistants in nursing homes: Challenges and results , 2003, Robotics Auton. Syst..

[35]  Angela S. Lin,et al.  Generating Animated Videos of Human Activities from Natural Language Descriptions , 2018 .