SAT: 2D Semantics Assisted Training for 3D Visual Grounding

3D visual grounding aims at grounding a natural language description about a 3D scene, usually represented in the form of 3D point clouds, to the targeted object region. Point clouds are sparse, noisy, and contain limited semantic information compared with 2D images. These inherent limitations make the 3D visual grounding problem more challenging. In this study, we propose 2D Semantics Assisted Training (SAT) that utilizes 2D image semantics in the training stage to ease point-cloud-language joint representation learning and assist 3D visual grounding. The main idea is to learn auxiliary alignments between rich, clean 2D object representations and the corresponding objects or mentioned entities in 3D scenes. SAT takes 2D object semantics, i.e., object label, image feature, and 2D geometric feature, as the extra input in training but does not require such inputs during inference. By effectively utilizing 2D semantics in training, our approach boosts the accuracy on the Nr3D dataset from 37.7% to 49.2%, which significantly surpasses the non-SAT baseline with the identical network architecture and inference input. Our approach outperforms the state of the art by large margins on multiple 3D visual grounding datasets, i.e., +10.4% absolute accuracy on Nr3D, +9.9% on Sr3D, and +5.6% on ScanRef.

[1]  Shin'ichi Satoh,et al.  Faster R-CNN Features for Instance Search , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[2]  Jiebo Luo,et al.  Grounding-Tracking-Integration , 2019, ArXiv.

[3]  Bernard Ghanem,et al.  2D-Driven 3D Object Detection in RGB-D Images , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[4]  Danfei Xu,et al.  PointFusion: Deep Sensor Fusion for 3D Bounding Box Estimation , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[5]  Trevor Darrell,et al.  Natural Language Object Retrieval , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Kan Chen,et al.  Video Object Grounding Using Semantic Roles in Language Description , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Liang Zhang,et al.  Free-form Description Guided 3D Visual Graph Network for Object Grounding in Point Cloud , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[8]  Ahmed Abdelreheem,et al.  ReferIt3D: Neural Listeners for Fine-Grained 3D Object Identification in Real-World Scenes , 2020, ECCV.

[9]  Steven Lake Waslander,et al.  Joint 3D Proposal Generation and Object Detection from View Aggregation , 2017, 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[10]  Leonidas J. Guibas,et al.  ImVoteNet: Boosting 3D Object Detection in Point Clouds With Image Votes , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Jitendra Malik,et al.  Gibson Env: Real-World Perception for Embodied Agents , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[12]  Michael S. Bernstein,et al.  Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.

[13]  Vivek Mittal AttnGrounder: Talking to Cars with Attention , 2020, ECCV Workshops.

[14]  Jiebo Luo,et al.  TAP: Text-Aware Pre-training for Text-VQA and Text-Caption , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Silvio Savarese,et al.  DenseFusion: 6D Object Pose Estimation by Iterative Dense Fusion , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[17]  Leonidas J. Guibas,et al.  Deep Hough Voting for 3D Object Detection in Point Clouds , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[18]  Jiebo Luo,et al.  A Fast and Accurate One-Stage Approach to Visual Grounding , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[19]  Trevor Darrell,et al.  Iterative Answer Prediction With Pointer-Augmented Multimodal Transformers for TextVQA , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Matthias Nießner,et al.  ScanNet: Richly-Annotated 3D Reconstructions of Indoor Scenes , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Bernard C. Kress,et al.  11‐1: Invited Paper: Towards the Ultimate Mixed Reality Experience: HoloLens Display Architecture Choices , 2017 .

[22]  Junyu Luo,et al.  TransRefer3D: Entity-and-Relation Aware Transformer for Fine-Grained 3D Visual Grounding , 2021, ACM Multimedia.

[23]  David J. Fleet,et al.  VSE++: Improving Visual-Semantic Embeddings with Hard Negatives , 2017, BMVC.

[24]  Licheng Yu,et al.  MAttNet: Modular Attention Network for Referring Expression Comprehension , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[25]  Abhinav Gupta,et al.  Scaling and Benchmarking Self-Supervised Visual Representation Learning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[26]  Alexei A. Efros,et al.  Split-Brain Autoencoders: Unsupervised Learning by Cross-Channel Prediction , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Kan Chen,et al.  Zero-Shot Grounding of Objects From Natural Language Queries , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[28]  Hwann-Tzong Chen,et al.  Text-Guided Graph Neural Networks for Referring 3D Instance Segmentation , 2021, AAAI.

[29]  Yang Zhao,et al.  Where Does It Exist: Spatio-Temporal Video Grounding for Multi-Form Sentences , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Mark Chen,et al.  Generative Pretraining From Pixels , 2020, ICML.

[31]  Fei-Fei Li,et al.  Deep visual-semantic alignments for generating image descriptions , 2015, CVPR.

[32]  Yun Fu,et al.  Visual Semantic Reasoning for Image-Text Matching , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[33]  Angel X. Chang,et al.  Scan2Cap: Context-aware Dense Captioning in RGB-D Scans , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Licheng Yu,et al.  Modeling Context in Referring Expressions , 2016, ECCV.

[35]  Liwei Wang,et al.  Learning Two-Branch Neural Networks for Image-Text Matching Tasks , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[36]  Silvio Savarese,et al.  3D Scene Graph: A Structure for Unified Semantics, 3D Space, and Camera , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[37]  Li Jiang,et al.  PointGroup: Dual-Set Point Grouping for 3D Instance Segmentation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Stan Sclaroff,et al.  CityFlow-NL: Tracking and Retrieval of Vehicles at City Scale by Natural Language Descriptions , 2021, ArXiv.

[39]  Svetlana Lazebnik,et al.  Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[40]  Jitendra Malik,et al.  Habitat: A Platform for Embodied AI Research , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[41]  Jianxiong Xiao,et al.  Deep Sliding Shapes for Amodal 3D Object Detection in RGB-D Images , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Wengang Zhou,et al.  TransVG: End-to-End Visual Grounding with Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[43]  Ruimao Zhang,et al.  InstanceRefer: Cooperative Holistic Understanding for Visual Grounding on Point Clouds through Instance Multi-level Contextual Referring , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[44]  Matthias Niessner,et al.  ScanRefer: 3D Object Localization in RGB-D Scans using Natural Language , 2020, ECCV.

[45]  Matthias Nießner,et al.  3D-SIS: 3D Semantic Instance Segmentation of RGB-D Scans , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[46]  Leonidas J. Guibas,et al.  PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space , 2017, NIPS.

[47]  Vicente Ordonez,et al.  ReferItGame: Referring to Objects in Photographs of Natural Scenes , 2014, EMNLP.

[48]  Alan L. Yuille,et al.  Generation and Comprehension of Unambiguous Object Descriptions , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[49]  Kangsoo Kim,et al.  Revisiting Trends in Augmented Reality Research: A Review of the 2nd Decade of ISMAR (2008–2017) , 2018, IEEE Transactions on Visualization and Computer Graphics.

[50]  Federico Tombari,et al.  Learning 3D Semantic Scene Graphs From 3D Indoor Reconstructions , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[51]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[52]  Yin Li,et al.  Learning Deep Structure-Preserving Image-Text Embeddings , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[53]  Jiebo Luo,et al.  Improving One-stage Visual Grounding by Recursive Sub-query Construction , 2020, ECCV.

[54]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[55]  Ali Farhadi,et al.  LanguageRefer: Spatial-Language Model for 3D Visual Grounding , 2021, CoRL.

[56]  Leonidas J. Guibas,et al.  Frustum PointNets for 3D Object Detection from RGB-D Data , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.