VLGrammar: Grounded Grammar Induction of Vision and Language

Cognitive grammar suggests that the acquisition of language grammar is grounded within visual structures. While grammar is an essential representation of natural language, it also exists ubiquitously in vision to represent the hierarchical part-whole structure. In this work, we study grounded grammar induction of vision and language in a joint learning framework. Specifically, we present VLGrammar, a method that uses compound probabilistic contextfree grammars (compound PCFGs) to induce the language grammar and the image grammar simultaneously. We propose a novel contrastive learning framework to guide the joint learning of both modules. To provide a benchmark for the grounded grammar induction task, we collect a largescale dataset, PARTIT, which contains human-written sentences that describe part-level semantics for 3D objects. Experiments on the PARTIT dataset show that VLGrammar outperforms all baselines in image grammar induction and language grammar induction. The learned VLGrammar naturally benefits related downstream tasks. Specifically, it improves the image unsupervised clustering accuracy by 30%, and performs well in image retrieval and text retrieval. Notably, the induced grammar shows superior generalizability by easily generalizing to unseen categories.

[1]  Lin Ma,et al.  Multimodal Convolutional Neural Networks for Matching Image and Sentence , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[2]  A. Simone,et al.  Guiding new physics searches with unsupervised learning , 2018, The European Physical Journal C.

[3]  Lei Zhang,et al.  Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[4]  Song-Chun Zhu,et al.  A Numerical Study of the Bottom-Up and Top-Down Inference Processes in And-Or Graphs , 2011, International Journal of Computer Vision.

[5]  Qing Li,et al.  Closed Loop Neural-Symbolic Learning via Integrating Neural Perception, Grammar Parsing, and Symbolic Reasoning , 2020, ICML.

[6]  Song-Chun Zhu,et al.  Learning AND-OR Templates for Object Recognition and Detection , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  Feng Han,et al.  Bottom-Up/Top-Down Image Parsing with Attribute Grammar , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8]  R. Langacker Foundations of cognitive grammar , 1983 .

[9]  Leonidas J. Guibas,et al.  StructureNet , 2019, ACM Trans. Graph..

[10]  Howard Hunt Pattee,et al.  Hierarchy Theory: The Challenge of Complex Systems , 1973 .

[11]  Cordelia Schmid,et al.  VideoBERT: A Joint Model for Video and Language Representation Learning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[12]  Geoffrey E. Hinton,et al.  How to Represent Part-Whole Hierarchies in a Neural Network , 2021, Neural Computation.

[13]  Yair Neuman,et al.  Literal and Metaphorical Sense Identification through Concrete and Abstract Context , 2011, EMNLP.

[14]  Geoffrey E. Hinton,et al.  Dynamic Routing Between Capsules , 2017, NIPS.

[15]  Stephen Clark,et al.  Improving Multi-Modal Representations Using Image Dispersion: Why Less is Sometimes More , 2014, ACL.

[16]  Kevin Gimpel,et al.  Visually Grounded Neural Syntax Acquisition , 2019, ACL.

[17]  Leonidas J. Guibas,et al.  PartNet: A Large-Scale Benchmark for Fine-Grained and Hierarchical Part-Level 3D Object Understanding , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Mark Johnson,et al.  The body in the mind: the bodily basis of meaning , 1988 .

[19]  Ronald W. Langacker,et al.  An Introduction to Cognitive Grammar , 1986, Cogn. Sci..

[20]  Kun Liu,et al.  PartNet: A Recursive Part Decomposition Network for Fine-Grained and Hierarchical Shape Segmentation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Furu Wei,et al.  VL-BERT: Pre-training of Generic Visual-Linguistic Representations , 2019, ICLR.

[22]  Armand Joulin,et al.  Deep Fragment Embeddings for Bidirectional Image Sentence Mapping , 2014, NIPS.

[23]  Alexander M. Rush,et al.  Compound Probabilistic Context-Free Grammars for Grammar Induction , 2019, ACL.

[24]  Alexander M. Rush,et al.  What is Learned in Visually Grounded Neural Syntax Acquisition , 2020, ACL.

[25]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[26]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[27]  O. Firschein,et al.  Syntactic pattern recognition and applications , 1983, Proceedings of the IEEE.

[28]  Jiasen Lu,et al.  Hierarchical Co-Attention for Visual Question Answering , 2016 .

[29]  Ivan Titov,et al.  Visually Grounded Compound PCFGs , 2020, EMNLP.

[30]  Mohit Bansal,et al.  LXMERT: Learning Cross-Modality Encoder Representations from Transformers , 2019, EMNLP.

[31]  Yu Cheng,et al.  UNITER: UNiversal Image-TExt Representation Learning , 2019, ECCV.

[32]  Dan Klein,et al.  A Generative Constituent-Context Model for Improved Grammar Induction , 2002, ACL.

[33]  Vladimir Solmon,et al.  The estimation of stochastic context-free grammars using the Inside-Outside algorithm , 2003 .

[34]  Valentin I. Spitkovsky,et al.  Viterbi Training Improves Unsupervised Dependency Parsing , 2010, CoNLL.

[35]  Jason Eisner,et al.  Inside-Outside and Forward-Backward Algorithms Are Just Backprop (tutorial paper) , 2016, SPNLP@EMNLP.

[36]  Aaron C. Courville,et al.  Neural Language Modeling by Jointly Learning Syntax and Lexicon , 2017, ICLR.

[37]  Kewei Tu,et al.  Unsupervised Structure Learning of Stochastic And-Or Grammars , 2013, NIPS.

[38]  J. R EKERS,et al.  Defining and Parsing Visual Languages with Layered Graph Grammars 1 , 1997 .

[39]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[40]  K S Fu,et al.  Syntactic Shape Recognition Using Attributed Grammars. , 1978 .

[41]  Zhuowen Tu,et al.  Image Parsing: Unifying Segmentation, Detection, and Recognition , 2005, International Journal of Computer Vision.

[42]  Stefan Lee,et al.  ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks , 2019, NeurIPS.

[43]  Aaron C. Courville,et al.  Ordered Neurons: Integrating Tree Structures into Recurrent Neural Networks , 2018, ICLR.

[44]  Leonidas J. Guibas,et al.  Learning hierarchical shape segmentation and labeling from online repositories , 2017, ACM Trans. Graph..

[45]  Luc Van Gool,et al.  SCAN: Learning to Classify Images Without Labels , 2020, ECCV.

[46]  Mohit Yadav,et al.  Unsupervised Latent Tree Induction with Deep Inside-Outside Recursive Auto-Encoders , 2019, NAACL.

[47]  Andy Schürr,et al.  Defining and Parsing Visual Languages with Layered Graph Grammars , 1997, J. Vis. Lang. Comput..

[48]  L. L. Cam,et al.  Asymptotic Methods In Statistical Decision Theory , 1986 .

[49]  J. Baker Trainable grammars for speech recognition , 1979 .

[50]  Song-Chun Zhu,et al.  Holistic 3D Scene Parsing and Reconstruction from a Single RGB Image , 2018, ECCV.

[51]  Fei-Fei Li,et al.  Deep visual-semantic alignments for generating image descriptions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[52]  Noah A. Smith,et al.  The Shared Logistic Normal Distribution for Grammar Induction , 2008 .

[53]  Alexander M. Rush,et al.  Unsupervised Recurrent Neural Network Grammars , 2019, NAACL.

[54]  Ruslan Salakhutdinov,et al.  Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models , 2014, ArXiv.

[55]  知秀 柴田 5分で分かる!? 有名論文ナナメ読み:Jacob Devlin et al. : BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding , 2020 .