PTR: A Benchmark for Part-based Conceptual, Relational, and Physical Reasoning

A critical aspect of human visual perception is the ability to parse visual scenes into individual objects and further into object parts, forming part-whole hierarchies. Such composite structures could induce a rich set of semantic concepts and relations, thus playing an important role in the interpretation and organization of visual signals as well as for the generalization of visual perception and reasoning. However, existing visual reasoning benchmarks mostly focus on objects rather than parts. Visual reasoning based on the full part-whole hierarchy is much more challenging than object-centric reasoning due to finer-grained concepts, richer geometry relations, and more complex physics. Therefore, to better serve for part-based conceptual, relational and physical reasoning, we introduce a new large-scale diagnostic visual reasoning dataset named PTR. PTR contains around 70k RGBD synthetic images with ground truth object and part level annotations regarding semantic instance segmentation, color attributes, spatial and geometric relationships, and certain physical properties such as stability. These images are paired with 700k machine-generated questions covering various types of reasoning types, making them a good testbed for visual reasoning models. We examine several state-of-the-art visual reasoning models on this dataset and observe that they still make many surprising mistakes in situations where humans can easily infer the correct answer. We believe this dataset will open up new opportunities for part-based reasoning. PTR dataset and baseline models are publicly available 2.

[1]  Leonidas J. Guibas,et al.  Learning hierarchical shape segmentation and labeling from online repositories , 2017, ACM Trans. Graph..

[2]  Chen Sun,et al.  VQS: Linking Segmentations to Questions and Answers for Supervised Attention in VQA and Question-Focused Semantic Segmentation , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[3]  Ross B. Girshick,et al.  Mask R-CNN , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  Li Fei-Fei,et al.  CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Siddhartha Chaudhuri,et al.  A probabilistic model for component-based shape synthesis , 2012, ACM Trans. Graph..

[6]  Aaron Hertzmann,et al.  Learning 3D mesh segmentation and labeling , 2010, ACM Trans. Graph..

[7]  Christopher D. Manning,et al.  GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Thomas A. Funkhouser,et al.  Consistent segmentation of 3D models , 2009, Comput. Graph..

[9]  Oren Etzioni,et al.  Diagram Understanding in Geometry Questions , 2014, AAAI.

[10]  Christopher D. Manning,et al.  Compositional Attention Networks for Machine Reasoning , 2018, ICLR.

[11]  Levent Burak Kara,et al.  Semantic shape editing using deformation handles , 2015, ACM Trans. Graph..

[12]  Trevor Darrell,et al.  Language-Conditioned Graph Networks for Relational Reasoning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[13]  Kaichun Mo,et al.  Learning to Group: A Bottom-Up Framework for 3D Part Discovery in Unseen Categories , 2020, ICLR.

[14]  Angel X. Chang,et al.  Linking WordNet to 3D Shapes , 2018, GWC.

[15]  Ariel Shamir,et al.  Learning how objects function via co-analysis of interactions , 2016, ACM Trans. Graph..

[16]  Subhransu Maji,et al.  3D Shape Segmentation with Projective Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  N. Mitra,et al.  Meta-representation of shape families , 2014, ACM Trans. Graph..

[18]  Leonidas J. Guibas,et al.  PartNet: A Large-Scale Benchmark for Fine-Grained and Hierarchical Part-Level 3D Object Understanding , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Qing Li,et al.  Closed Loop Neural-Symbolic Learning via Integrating Neural Perception, Grammar Parsing, and Symbolic Reasoning , 2020, ICML.

[20]  Thomas Kipf,et al.  Object-Centric Learning with Slot Attention , 2020, NeurIPS.

[21]  Josh H. McDermott,et al.  ThreeDWorld: A Platform for Interactive Multi-Modal Physical Simulation , 2020, NeurIPS Datasets and Benchmarks.

[22]  Song-Chun Zhu,et al.  SMART: A Situation Model for Algebra Story Problems via Attributed Grammar , 2020, AAAI.

[23]  Geoffrey E. Hinton,et al.  How to Represent Part-Whole Hierarchies in a Neural Network , 2021, Neural Computation.

[24]  Lin Gao SDM-NET : Deep Generative Network for Structured Deformable Mesh , 2019 .

[25]  Dani Lischinski,et al.  SAGNet , 2018, ACM Trans. Graph..

[26]  Ali Farhadi,et al.  From Recognition to Cognition: Visual Commonsense Reasoning , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Leonidas J. Guibas,et al.  A scalable active framework for region annotation in 3D shape collections , 2016, ACM Trans. Graph..

[28]  Kaichun Mo,et al.  Compositionally Generalizable 3D Structure Prediction , 2020, ArXiv.

[29]  Stephen DiVerdi,et al.  Learning part-based templates from large collections of 3D shapes , 2013, ACM Trans. Graph..

[30]  Yann LeCun,et al.  MDETR - Modulated Detection for End-to-End Multi-Modal Understanding , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[31]  Leonidas J. Guibas,et al.  GRASS: Generative Recursive Autoencoders for Shape Structures , 2017, ACM Trans. Graph..

[32]  Chuang Gan,et al.  CLEVRER: CoLlision Events for Video REpresentation and Reasoning , 2020, ICLR.

[33]  S. Gershman,et al.  Language-Mediated, Object-Centric Representation Learning , 2020, FINDINGS.

[34]  Leonidas J. Guibas,et al.  Shapeglot: Learning Language for Shape Differentiation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[35]  Ligang Liu,et al.  3D Shape Segmentation and Labeling via Extreme Learning Machine , 2014, Comput. Graph. Forum.

[36]  Leonidas J. Guibas,et al.  ComplementMe , 2017, ACM Trans. Graph..

[37]  Vladlen Koltun,et al.  Joint shape segmentation with linear programming , 2011, ACM Trans. Graph..

[38]  Margaret Mitchell,et al.  VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[39]  Leonidas J. Guibas,et al.  ShapeNet: An Information-Rich 3D Model Repository , 2015, ArXiv.

[40]  Leonidas J. Guibas,et al.  StructureNet , 2019, ACM Trans. Graph..

[41]  Chuang Gan,et al.  Neural-Symbolic VQA: Disentangling Reasoning from Vision and Language Understanding , 2018, NeurIPS.

[42]  Yash Goyal,et al.  Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  Heng Tao Shen,et al.  The Gap of Semantic Parsing: A Survey on Automatic Math Word Problem Solvers , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[44]  Joshua B. Tenenbaum,et al.  Grounding Physical Concepts of Objects and Events Through Dynamic Visual Reasoning , 2021, ICLR.

[45]  Lei Zhang,et al.  Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[46]  Leonidas J. Guibas,et al.  StructEdit: Learning Structural Shape Variations , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[47]  Michael S. Bernstein,et al.  Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.

[48]  Fei Deng,et al.  Generative Scene Graph Networks , 2021, ICLR.

[49]  Oliver Brock,et al.  Manipulating articulated objects with interactive perception , 2008, 2008 IEEE International Conference on Robotics and Automation.

[50]  O. Reiser,et al.  Principles Of Gestalt Psychology , 1936 .

[51]  Yoav Artzi,et al.  A Corpus for Reasoning about Natural Language Grounded in Photographs , 2018, ACL.

[52]  Xiaogang Wang,et al.  Shape2Motion: Joint Analysis of Motion Parts and Attributes From 3D Shapes , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[53]  Feng Gao,et al.  RAVEN: A Dataset for Relational and Analogical Visual REasoNing , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[54]  Leonidas J. Guibas,et al.  Probabilistic reasoning for assembly-based 3D modeling , 2011, ACM Trans. Graph..

[55]  Ariel Shamir,et al.  Learning to predict part mobility from a single static snapshot , 2017, ACM Trans. Graph..

[56]  Song-Chun Zhu,et al.  VLGrammar: Grounded Grammar Induction of Vision and Language , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[57]  Michael S. Bernstein,et al.  Visual7W: Grounded Question Answering in Images , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[58]  Leonidas J. Guibas,et al.  SAPIEN: A SimulAted Part-Based Interactive ENvironment , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[59]  Szymon Rusinkiewicz,et al.  Modeling by example , 2004, ACM Trans. Graph..

[60]  David A. McAllester,et al.  Object Detection with Discriminatively Trained Part Based Models , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[61]  Oliver Brock,et al.  The RBO dataset of articulated objects and interactions , 2018, Int. J. Robotics Res..

[62]  Chuang Gan,et al.  The Neuro-Symbolic Concept Learner: Interpreting Scenes Words and Sentences from Natural Supervision , 2019, ICLR.

[63]  Song-Chun Zhu,et al.  Learning by Fixing: Solving Math Word Problems with Weak Supervision , 2020, AAAI.

[64]  Song-Chun Zhu,et al.  Inter-GPS: Interpretable Geometry Problem Solving with Formal Language and Symbolic Reasoning , 2021, ACL.