Deconstructing multimodality: visual properties and visual context in human semantic processing

Multimodal semantic models that extend linguistic representations with additional perceptual input have proved successful in a range of natural language processing (NLP) tasks. Recent research has successfully used neural methods to automatically create visual representations for words. However, these works have extracted visual features from complete images, and have not examined how different kinds of visual information impact performance. In contrast, we construct multimodal models that differentiate between internal visual properties of the objects and their external visual context. We evaluate the models on the task of decoding brain activity associated with the meanings of nouns, demonstrating their advantage over those based on complete images.

[1]  Stephen Clark,et al.  Exploiting Image Generality for Lexical Entailment Detection , 2015, ACL.

[2]  Max M. Louwerse,et al.  Symbol Interdependency in Symbolic and Embodied Cognition , 2011, Top. Cogn. Sci..

[3]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[4]  Massimo Poesio,et al.  Reading visually embodied meaning from the brain: Visually grounded computational models decode visual-object mental imagery induced by written text , 2015, NeuroImage.

[5]  Nicu Sebe,et al.  Distributional semantics with eyes: using image analysis to improve computational representations of word meaning , 2012, ACM Multimedia.

[6]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[7]  Tom Michael Mitchell,et al.  Predicting Human Brain Activity Associated with the Meanings of Nouns , 2008, Science.

[8]  Douwe Kiela MMFeat: A Toolkit for Extracting Multi-Modal Features , 2016, ACL.

[9]  Tom M. Mitchell,et al.  Selecting Corpus-Semantic Models for Neurolinguistic Decoding , 2012, *SEMEVAL.

[10]  Benjamin D. Zinszer,et al.  Representational similarity encoding for fMRI: Pattern-based synthesis to predict brain activity using stimulus-model-similarities , 2016, NeuroImage.

[11]  Francisco Pereira,et al.  Using Wikipedia to learn semantic feature representations of concrete concepts in neuroimaging experiments , 2013, Artif. Intell..

[12]  Quoc V. Le,et al.  Grounded Compositional Semantics for Finding and Describing Images with Sentences , 2014, TACL.

[13]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[14]  Stephen Clark,et al.  Speaking, Seeing, Understanding: Correlating semantic models with conceptual representation in the brain , 2017, EMNLP.

[15]  Stephen Clark,et al.  RELPRON: A Relative Clause Evaluation Data Set for Compositional Distributional Semantics , 2016, CL.

[16]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[17]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[18]  Carina Silberer,et al.  Learning Grounded Meaning Representations with Autoencoders , 2014, ACL.

[19]  Michael S. Bernstein,et al.  Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.

[20]  Jack L. Gallant,et al.  A Continuous Semantic Space Describes the Representation of Thousands of Object and Action Categories across the Human Brain , 2012, Neuron.

[21]  Massimo Poesio,et al.  Visually Grounded and Textual Semantic Models Differentially Decode Brain Activity Associated with Concrete and Abstract Nouns , 2017, TACL.

[22]  Anna Korhonen,et al.  Using fMRI activation to conceptual stimuli to evaluate methods for extracting conceptual representations from corpora , 2010, HLT-NAACL 2010.

[23]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[24]  Jean Maillard,et al.  Black Holes and White Rabbits: Metaphor Identification with Visual Features , 2016, NAACL.

[25]  Nitish Srivastava,et al.  Multimodal learning with deep Boltzmann machines , 2012, J. Mach. Learn. Res..

[26]  Nikolaus Kriegeskorte,et al.  Frontiers in Systems Neuroscience Systems Neuroscience , 2022 .

[27]  Léon Bottou,et al.  Learning Image Embeddings using Convolutional Neural Networks for Improved Multi-Modal Semantics , 2014, EMNLP.

[28]  Matthijs C. Dorst Distinctive Image Features from Scale-Invariant Keypoints , 2011 .

[29]  A. Ishai,et al.  Distributed and Overlapping Representations of Faces and Objects in Ventral Temporal Cortex , 2001, Science.