Embodied Language Grounding With 3D Visual Feature Representations

We propose associating language utterances to 3D visual abstractions of the scene they describe. The 3D visual abstractions are encoded as 3-dimensional visual feature maps. We infer these 3D visual scene feature maps from RGB images of the scene via view prediction: when the generated 3D scene feature map is neurally projected from a camera viewpoint, it should match the corresponding RGB image. We present generative models that condition on the dependency tree of an utterance and generate a corresponding visual 3D feature map as well as reason about its plausibility, and detector models that condition on both the dependency tree of an utterance and a related image and localize the object referents in the 3D feature map inferred from the image. Our model outperforms models of language and vision that associate language with 2D CNN activations or 2D images by a large margin in a variety of tasks, such as, classifying plausibility of utterances, detecting referential expressions, and supplying rewards for trajectory optimization of object placement policies from language instructions. We perform numerous ablations and show the improved performance of our detectors is due to its better generalization across camera viewpoints and lack of object interferences in the inferred 3D feature space, and the improved performance of our generators is due to their ability to spatially reason about objects and their configurations in 3D when mapping from language to scenes.

[1]  Bruno A. Olshausen,et al.  Perception as an Inference Problem , 2013 .

[2]  Pushmeet Kohli,et al.  Vision-as-Inverse-Graphics: Obtaining a Rich 3D Explanation of a Scene from a Single Image , 2017, 2017 IEEE International Conference on Computer Vision Workshops (ICCVW).

[3]  Li Fei-Fei,et al.  CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Jason Weston,et al.  Memory Networks , 2014, ICLR.

[5]  Bernt Schiele,et al.  A dataset for Movie Description , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Geoffrey Zweig,et al.  Language Models for Image Captioning: The Quirks and What Works , 2015, ACL.

[7]  Michael P. Kaschak,et al.  Grounding language in action , 2002, Psychonomic bulletin & review.

[8]  Emmanuel Dupoux,et al.  IntPhys 2019: A Benchmark for Visual Intuitive Physics Understanding , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  J. Gibson The Ecological Approach to Visual Perception , 1979 .

[10]  Ruslan Salakhutdinov,et al.  Gated-Attention Readers for Text Comprehension , 2016, ACL.

[11]  Bernt Schiele,et al.  Translating Video Content to Natural Language Descriptions , 2013, 2013 IEEE International Conference on Computer Vision.

[12]  Fei-Fei Li,et al.  Deep visual-semantic alignments for generating image descriptions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Jerome A. Feldman,et al.  From Molecule to Metaphor - A Neural Theory of Language , 2006 .

[14]  A. Glenberg,et al.  Symbol Grounding and Meaning: A Comparison of High-Dimensional and Embodied Theories of Meaning , 2000 .

[15]  Yuval Tassa,et al.  Control-limited differential dynamic programming , 2014, 2014 IEEE International Conference on Robotics and Automation (ICRA).

[16]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17]  Trevor Darrell,et al.  Modeling Relationships in Referential Expressions with Compositional Modular Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Greg Mori,et al.  Probabilistic Neural Programmed Networks for Scene Generation , 2018, NeurIPS.

[19]  Erwin Coumans,et al.  Bullet physics simulation , 2015, SIGGRAPH Courses.

[20]  Karen Emmorey,et al.  Modulation of BOLD Response in Motion-sensitive Lateral Temporal Cortex by Real and Fictive Motion Sentences , 2010, Journal of Cognitive Neuroscience.

[21]  B. Bergen Experimental methods for simulation semantics , 2007 .

[22]  Alexander J. Smola,et al.  Stacked Attention Networks for Image Question Answering , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Katerina Fragkiadaki,et al.  Adversarial Inverse Graphics Networks: Learning 2D-to-3D Lifting and Image-to-Image Translation from Unpaired Supervision , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[24]  Seong Joon Oh,et al.  Generating Descriptions with Grounded and Co-referenced People , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  C. Lawrence Zitnick,et al.  Learning Common Sense through Visual Abstraction , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[26]  J. Feldman,et al.  Embodied meaning in a neural theory of language , 2004, Brain and Language.

[27]  Geoffrey Zweig,et al.  From captions to visual concepts and back , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Louis-Philippe Morency,et al.  Using Syntax to Ground Referring Expressions in Natural Images , 2018, AAAI.

[29]  Phil Blunsom,et al.  Teaching Machines to Read and Comprehend , 2015, NIPS.

[30]  Bernt Schiele,et al.  A database for fine grained activity detection of cooking activities , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[31]  Rudolf Kadlec,et al.  Text Understanding with the Attention Sum Reader Network , 2016, ACL.

[32]  Katerina Fragkiadaki,et al.  Learning Spatial Common Sense With Geometry-Aware Recurrent Networks , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Ross B. Girshick,et al.  Mask R-CNN , 2017, 1703.06870.

[34]  Slav Petrov,et al.  Structured Training for Neural Network Transition-Based Parsing , 2015, ACL.

[35]  B. Bergen Mental Simulation in Spatial Language Processing , 2005 .

[36]  Katerina Fragkiadaki,et al.  Material for “ Adversarial Inverse Graphics Networks : Learning 2 Dto-3 D Lifting and Image-to-Image Translation from Unpaired Supervision ” , 2017 .

[37]  Sergey Levine,et al.  Learning Dexterous Manipulation Policies from Experience and Imitation , 2016, ArXiv.

[38]  Chuang Gan,et al.  The Neuro-Symbolic Concept Learner: Interpreting Scenes Words and Sentences from Natural Supervision , 2019, ICLR.

[39]  Margaret Mitchell,et al.  VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[40]  Joshua B. Tenenbaum,et al.  Deep Convolutional Inverse Graphics Network , 2015, NIPS.

[41]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[42]  Jonathan Berant,et al.  Semantic Parsing via Paraphrasing , 2014, ACL.

[43]  B. Bergen Embodiment, simulation and meaning , 2015 .

[44]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  R. Ivry,et al.  Please Scroll down for Article Social Neuroscience Modulation of the Ffa and Ppa by Language Related to Faces and Places Modulation of the Ffa and Ppa by Language Related to Faces and Places , 2022 .

[46]  Daniel Casasanto,et al.  Neural Dissociations between Action Verb Understanding and Motor Imagery , 2010, Journal of Cognitive Neuroscience.

[47]  Geoffrey E. Hinton,et al.  Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer , 2017, ICLR.

[48]  Sergey Levine,et al.  End-to-End Training of Deep Visuomotor Policies , 2015, J. Mach. Learn. Res..

[49]  Dan Klein,et al.  Learning to Compose Neural Networks for Question Answering , 2016, NAACL.