3D Concept Learning and Reasoning from Multi-View Images

Humans are able to accurately reason in 3D by gathering multi-view observations of the surrounding world. Inspired by this insight, we introduce a new large-scale benchmark for 3D multi-view visual question answering (3DMV-VQA). This dataset is collected by an embodied agent actively moving and capturing RGB images in an environment using the Habitat simulator. In total, it consists of approximately 5k scenes, 600k images, paired with 50k questions. We evaluate various state-of-the-art models for visual reasoning on our benchmark and find that they all perform poorly. We suggest that a principled approach for 3D reasoning from multi-view images should be to infer a compact 3D representation of the world from the multi-view images, which is further grounded on open-vocabulary semantic concepts, and then to execute reasoning on these 3D representations. As the first step towards this approach, we propose a novel 3D concept learning and reasoning (3D-CLR) framework that seamlessly combines these components via neural fields, 2D pre-trained vision-language models, and neural reasoning operators. Experimental results suggest that our framework outperforms baseline models by a large margin, but the challenge remains largely unsolved. We further perform an in-depth analysis of the challenges and highlight potential future directions.

[1]  Hao Zhang,et al.  See, Think, Confirm: Interactive Prompting Between Vision and Language Models for Knowledge-based Visual Reasoning , 2023, ArXiv.

[2]  Ram Ramrakhya,et al.  Habitat-Matterport 3D Semantics Dataset , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  J. Tenenbaum,et al.  3D Concept Grounding on Neural Fields , 2022, NeurIPS.

[4]  O. C. Jenkins,et al.  VLMbench: A Compositional Benchmark for Vision-and-Language Manipulation , 2022, NeurIPS.

[5]  J. Tenenbaum,et al.  ComPhy: Compositional Physical Reasoning of Objects and Events from Videos , 2022, ICLR.

[6]  Angel X. Chang,et al.  3DVQA: Visual Question Answering for 3D Environments , 2022, 2022 19th Conference on Robots and Vision (CRV).

[7]  M. Tarr,et al.  Learning Neural Acoustic Fields , 2022, NeurIPS.

[8]  G. Sukhatme,et al.  DialFRED: Dialogue-Enabled Agents for Embodied Instruction Following , 2022, IEEE Robotics and Automation Letters.

[9]  Brian M. Sadler,et al.  One Step at a Time: Long-Horizon Vision-and-Language Navigation with Milestones , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Kilian Q. Weinberger,et al.  Language-driven Semantic Segmentation , 2022, ICLR.

[11]  M. Kawanabe,et al.  ScanQA: 3D Question Answering for Spatial Scene Understanding , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Junnan Li,et al.  Align and Prompt: Video-and-Language Pre-training with Entity Prompts , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Dongdong Chen,et al.  3D Question Answering , 2021, IEEE transactions on visualization and computer graphics.

[14]  Dongdong Chen,et al.  CLIP-NeRF: Text-and-Image Driven Manipulation of Neural Radiance Fields , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Vincent Sitzmann,et al.  Neural Descriptor Fields: SE(3)-Equivariant Object Representations for Manipulation , 2021, 2022 International Conference on Robotics and Automation (ICRA).

[16]  P. Abbeel,et al.  Zero-Shot Text-Guided Object Generation with Dream Fields , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Hwann-Tzong Chen,et al.  Direct Voxel Grid Optimization: Super-fast Convergence for Radiance Fields Reconstruction , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Shuguang Cui,et al.  CLEVR3D: Compositional Language and Elementary Visual Reasoning for Question Answering in 3D Real-World Scenes , 2021, ArXiv.

[19]  Joshua B. Tenenbaum,et al.  PTR: A Benchmark for Part-based Conceptual, Relational, and Physical Reasoning , 2021, NeurIPS.

[20]  Vincent Sitzmann,et al.  Learning Signal-Agnostic Manifolds of Neural Fields , 2021, NeurIPS.

[21]  Angel X. Chang,et al.  Habitat-Matterport 3D Dataset (HM3D): 1000 Large-scale 3D Environments for Embodied AI , 2021, NeurIPS Datasets and Benchmarks.

[22]  Alessandro Suglia,et al.  Embodied BERT: A Transformer Model for Embodied, Language-guided Visual Task Completion , 2021, ArXiv.

[23]  Angel X. Chang,et al.  Habitat 2.0: Training Home Assistants to Rearrange their Habitat , 2021, NeurIPS.

[24]  Hwann-Tzong Chen,et al.  Text-Guided Graph Neural Networks for Referring 3D Instance Segmentation , 2021, AAAI.

[25]  Yuke Zhu,et al.  Synergies Between Affordance and Geometry: 6-DoF Grasp Detection via Implicit Representations , 2021, Robotics: Science and Systems.

[26]  Liang Zhang,et al.  Free-form Description Guided 3D Visual Graph Network for Object Grounding in Point Cloud , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[27]  Joshua B. Tenenbaum,et al.  Grounding Physical Concepts of Objects and Events Through Dynamic Visual Reasoning , 2021, ICLR.

[28]  Jonathan T. Barron,et al.  Baking Neural Radiance Fields for Real-Time View Synthesis , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[29]  Ren Ng,et al.  PlenOctrees for Real-time Rendering of Neural Radiance Fields , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[30]  Stephan J. Garbin,et al.  FastNeRF: High-Fidelity Neural Rendering at 200FPS , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[31]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[32]  Alec Radford,et al.  Zero-Shot Text-to-Image Generation , 2021, ICML.

[33]  Jiajun Wu,et al.  Neural Radiance Flow for 4D View Synthesis and Video Processing , 2020, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[34]  Angel X. Chang,et al.  Scan2Cap: Context-aware Dense Captioning in RGB-D Scans , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  S. Kollmannsberger,et al.  Physics-Informed Neural Networks , 2021, Deep Learning in Computational Mechanics.

[36]  Ahmed Abdelreheem,et al.  ReferIt3D: Neural Listeners for Fine-Grained 3D Object Identification in Real-World Scenes , 2020, ECCV.

[37]  Runhao Zeng,et al.  Location-Aware Graph Convolutional Networks for Video Question Answering , 2020, AAAI.

[38]  Ronen Basri,et al.  Multiview Neural Surface Reconstruction by Disentangling Geometry and Appearance , 2020, NeurIPS.

[39]  Pratul P. Srinivasan,et al.  NeRF , 2020, ECCV.

[40]  Qi Wu,et al.  Cops-Ref: A New Dataset and Task on Compositional Referring Expression Comprehension , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Angel X. Chang,et al.  ScanRefer: 3D Object Localization in RGB-D Scans using Natural Language , 2019, ECCV.

[42]  Andreas Geiger,et al.  Differentiable Volumetric Rendering: Learning Implicit 3D Representations Without 3D Supervision , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  Luke Zettlemoyer,et al.  ALFRED: A Benchmark for Interpreting Grounded Instructions for Everyday Tasks , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Andreas Geiger,et al.  Occupancy Flow: 4D Reconstruction by Learning Particle Dynamics , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[45]  Michael Goesele,et al.  The Replica Dataset: A Digital Replica of Indoor Spaces , 2019, ArXiv.

[46]  Lin Ma,et al.  Weakly-Supervised Spatio-Temporally Grounding Natural Sentence in Video , 2019, ACL.

[47]  Gordon Wetzstein,et al.  Scene Representation Networks: Continuous 3D-Structure-Aware Neural Scene Representations , 2019, NeurIPS.

[48]  Hao Li,et al.  PIFu: Pixel-Aligned Implicit Function for High-Resolution Clothed Human Digitization , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[49]  Chuang Gan,et al.  The Neuro-Symbolic Concept Learner: Interpreting Scenes Words and Sentences from Natural Supervision , 2019, ICLR.

[50]  Xinlei Chen,et al.  Multi-Target Embodied Question Answering , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[51]  Jitendra Malik,et al.  Habitat: A Platform for Embodied AI Research , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[52]  Sebastian Nowozin,et al.  Occupancy Networks: Learning 3D Reconstruction in Function Space , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[53]  Chuang Gan,et al.  Neural-Symbolic VQA: Disentangling Reasoning from Vision and Language Understanding , 2018, NeurIPS.

[54]  Christopher D. Manning,et al.  Compositional Attention Networks for Machine Reasoning , 2018, ICLR.

[55]  Ali Farhadi,et al.  IQA: Visual Question Answering in Interactive Environments , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[56]  Stefan Lee,et al.  Embodied Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[57]  Chen Sun,et al.  VQS: Linking Segmentations to Questions and Answers for Supervised Attention in VQA and Question-Focused Semantic Segmentation , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[58]  Abhinav Gupta,et al.  What's in a Question: Using Visual Questions as a Form of Supervision , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[59]  Li Fei-Fei,et al.  CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[60]  Yash Goyal,et al.  Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering , 2016, International Journal of Computer Vision.

[61]  Sanja Fidler,et al.  Order-Embeddings of Images and Language , 2015, ICLR.

[62]  Michael S. Bernstein,et al.  Visual7W: Grounded Question Answering in Images , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[63]  Constantin Orasan,et al.  Interactive Question Answering , 2013 .

[64]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[65]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[66]  Nelson L. Max,et al.  Optical Models for Direct Volume Rendering , 1995, IEEE Trans. Vis. Comput. Graph..

[67]  E. Spelke,et al.  Gestalt Relations and Object Perception: A Developmental Study , 1993, Perception.

[68]  B. Landau,et al.  “What” and “where” in spatial language and spatial cognition , 1993 .