Multimodal Distributional Semantics

Distributional semantic models derive computational representations of word meaning from the patterns of co-occurrence of words in text. Such models have been a success story of computational linguistics, being able to provide reliable estimates of semantic relatedness for the many semantic tasks requiring them. However, distributional models extract meaning information exclusively from text, which is an extremely impoverished basis compared to the rich perceptual sources that ground human semantic knowledge. We address the lack of perceptual grounding of distributional models by exploiting computer vision techniques that automatically identify discrete "visual words" in images, so that the distributional representation of a word can be extended to also encompass its co-occurrence with the visual words of images it is associated with. We propose a flexible architecture to integrate text- and image-based distributional information, and we show in a set of empirical tests that our integrated model is superior to the purely text-based approach, and it provides somewhat complementary semantic information with respect to the latter.

[1]  J. Firth,et al.  Papers in linguistics, 1934-1951 , 1957 .

[2]  J. Firth Papers in linguistics , 1958 .

[3]  John B. Goodenough,et al.  Contextual correlates of synonymy , 1965, CACM.

[4]  P. Kay Basic Color Terms: Their Universality and Evolution , 1969 .

[5]  Paul Beaudet,et al.  Rotationally invariant image operators , 1978 .

[6]  Zellig S. Harris,et al.  Distributional Structure , 1954 .

[7]  S. P. Lloyd,et al.  Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[8]  M. Land Visual Perception: Physiology, Psychology and Ecology, Vicki Bruce, Patrick Green. Lawrence Erlbaum, London (1985), xiii, +369. Price £8.95 (paperback) , 1986 .

[9]  John F. Canny,et al.  A Computational Approach to Edge Detection , 1986, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[10]  Christopher G. Harris,et al.  A Combined Corner and Edge Detector , 1988, Alvey Vision Conference.

[11]  Kenneth Ward Church,et al.  Word Association Norms, Mutual Information, and Lexicography , 1989, ACL.

[12]  H. J. Arnold Introduction to the Practice of Statistics , 1990 .

[13]  G. Miller,et al.  Contextual correlates of semantic similarity , 1991 .

[14]  M. Goodale,et al.  Separate visual pathways for perception and action , 1992, Trends in Neurosciences.

[15]  Gregory Grefenstette,et al.  Explorations in automatic thesaurus discovery , 1994 .

[16]  Curt Burgess,et al.  Producing high-dimensional semantic spaces from lexical co-occurrence , 1996 .

[17]  T. Landauer,et al.  A Solution to Plato's Problem: The Latent Semantic Analysis Theory of Acquisition, Induction, and Representation of Knowledge. , 1997 .

[18]  J. Raaijmakers,et al.  Does pizza prime coin? Perceptual priming in lexical decision and pronunciation. , 1998 .

[19]  J. Haxby,et al.  Attribute-based neural substrates in temporal cortex for perceiving and knowing about objects , 1999, Nature Neuroscience.

[20]  Hang Li,et al.  Review of Ambiguity resolution in language learning: computational and cognitive models by Hinrich Schütze. CSLI Publications 1997. , 1999 .

[21]  David G. Lowe,et al.  Object recognition from local scale-invariant features , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[22]  A. Glenberg,et al.  Symbol Grounding and Meaning: A Comparison of High-Dimensional and Embodied Theories of Meaning , 2000 .

[23]  H. Sebastian Seung,et al.  Algorithms for Non-negative Matrix Factorization , 2000, NIPS.

[24]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[25]  Curt Burgess,et al.  INVITED REPLY Theory and Operational Definitions in Computational Memory Models: A Response to Glenberg and Robertson , 2000 .

[26]  W. Lowe,et al.  Towards a Theory of Semantic Space , 2001 .

[27]  Monica C. Jackson,et al.  Introduction to the Practice of Statistics , 2001 .

[28]  Michael Wilson MRC Psycholinguistic Database , 2001 .

[29]  A. Ishai,et al.  Distributed and Overlapping Representations of Faces and Objects in Ventral Temporal Cortex , 2001, Science.

[30]  G. Karypis,et al.  Criterion Functions for Document Clustering ∗ Experiments and Analysis , 2001 .

[31]  Paul A. Viola,et al.  Rapid object detection using a boosted cascade of simple features , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[32]  N. Tzourio-Mazoyer,et al.  Automated Anatomical Labeling of Activations in SPM Using a Macroscopic Anatomical Parcellation of the MNI MRI Single-Subject Brain , 2002, NeuroImage.

[33]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[34]  George Karypis,et al.  CLUTO - A Clustering Toolkit , 2002 .

[35]  Mark D. Fairchild,et al.  Status of CIE color appearance models , 2002, Other Conferences.

[36]  G. Murphy,et al.  The Big Book of Concepts , 2002 .

[37]  David J. Freedman,et al.  The prefrontal cortex: categories, concepts and cognition. , 2002, Philosophical transactions of the Royal Society of London. Series B, Biological sciences.

[38]  Ehud Rivlin,et al.  Placing search in context: the concept revisited , 2002, TOIS.

[39]  James R. Curran,et al.  Improvements in Automatic Thesaurus Extraction , 2002, ACL 2002.

[40]  M. Tarr,et al.  Visual Object Recognition , 1996, ISTCS.

[41]  Susan T. Dumais,et al.  Data-driven approaches to information access , 2003, Cogn. Sci..

[42]  Andrew Zisserman,et al.  Video Google: a text retrieval approach to object matching in videos , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[43]  David A. Forsyth,et al.  Matching Words and Pictures , 2003, J. Mach. Learn. Res..

[44]  R. Rapp Word sense discovery based on sense descriptor dissimilarity , 2003, MTSUMMIT.

[45]  Bruce D. McCandliss,et al.  The visual word form area: expertise for reading in the fusiform gyrus , 2003, Trends in Cognitive Sciences.

[46]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[47]  Julie Elizabeth Weeds,et al.  Measures and applications of lexical distributional similarity , 2003 .

[48]  Cordelia Schmid,et al.  Evaluation of Interest Point Detectors , 2000, International Journal of Computer Vision.

[49]  Katherine A. Rawson,et al.  Category Norms: An Updated and Expanded Version of the Battig and Montague (1969) Norms. , 2004 .

[50]  Scott McDonald,et al.  A Distributional Model of Semantic Context Effects in Lexical Processing , 2004, ACL.

[51]  G LoweDavid,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004 .

[52]  Song-Chun Zhu,et al.  What are Textons? , 2005, International Journal of Computer Vision.

[53]  Thomas A. Schreiber,et al.  The University of South Florida free association, rhyme, and word fragment norms , 2004, Behavior research methods, instruments, & computers : a journal of the Psychonomic Society, Inc.

[54]  Antonio Torralba,et al.  Modeling the Shape of the Scene: A Holistic Representation of the Spatial Envelope , 2001, International Journal of Computer Vision.

[55]  John Shawe-Taylor,et al.  Canonical Correlation Analysis: An Overview with Application to Learning Methods , 2004, Neural Computation.

[56]  Tony Lindeberg,et al.  Feature Detection with Automatic Scale Selection , 1998, International Journal of Computer Vision.

[57]  Gabriela Csurka,et al.  Visual categorization with bags of keypoints , 2002, eccv 2004.

[58]  Laura A. Dabbish,et al.  Labeling images with a computer game , 2004, AAAI Spring Symposium: Knowledge Collection from Volunteer Contributors.

[59]  Thomas Hofmann,et al.  Unsupervised Learning by Probabilistic Latent Semantic Analysis , 2004, Machine Learning.

[60]  Massimo Poesio,et al.  Identifying Concept Attributes Using a Classifier , 2005, ACL 2005.

[61]  Magnus Sahlgren,et al.  An Introduction to Random Indexing , 2005 .

[62]  Massimo Poesio,et al.  Concept Learning and Categorization from the Web , 2005 .

[63]  P. Downing,et al.  Selectivity for the human body in the fusiform gyrus. , 2005, Journal of neurophysiology.

[64]  Trevor Darrell,et al.  The pyramid match kernel: discriminative classification with sets of image features , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[65]  Michael P. Kaschak,et al.  Perception of motion affects language processing , 2005, Cognition.

[66]  Cordelia Schmid,et al.  A performance evaluation of local descriptors , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[67]  Mark S. Seidenberg,et al.  Semantic feature production norms for a large set of living and nonliving things , 2005, Behavior research methods.

[68]  Friedemann Pulvermüller,et al.  Brain mechanisms linking language and action , 2005, Nature Reviews Neuroscience.

[69]  P. Hagoort On Broca, brain, and binding: a new framework , 2005, Trends in Cognitive Sciences.

[70]  Cordelia Schmid,et al.  A Performance Evaluation of Local Descriptors , 2005, IEEE Trans. Pattern Anal. Mach. Intell..

[71]  Graeme Hirst,et al.  Evaluating WordNet-based Measures of Lexical Semantic Relatedness , 2006, CL.

[72]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[73]  Frédéric Jurie,et al.  Sampling Strategies for Bag-of-Features Image Classification , 2006, ECCV.

[74]  Luis von Ahn Games with a Purpose , 2006, Computer.

[75]  Cordelia Schmid,et al.  Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[76]  Abdulrahman Almuhareb,et al.  Attributes in lexical acquisition , 2006 .

[77]  N. Kanwisher,et al.  The fusiform face area: a cortical region specialized for the perception of faces , 2006, Philosophical Transactions of the Royal Society B: Biological Sciences.

[78]  David Nistér,et al.  Scalable Recognition with a Vocabulary Tree , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[79]  Frédéric Jurie,et al.  Fast Discriminative Visual Codebooks using Randomized Clustering Forests , 2006, NIPS.

[80]  K. Gegenfurtner,et al.  Memory modulates color appearance , 2006, Nature Neuroscience.

[81]  S. Harnad Symbol grounding problem , 1991, Scholarpedia.

[82]  Chong-Wah Ngo,et al.  Evaluating bag-of-visual-words representations in scene classification , 2007, MIR '07.

[83]  Andrew Zisserman,et al.  Learning Visual Attributes , 2007, NIPS.

[84]  Mark Steyvers,et al.  Topics in semantic representation. , 2007, Psychological review.

[85]  Katrin Erk,et al.  Flexible, Corpus-Based Modelling of Human Plausibility Judgements , 2007, EMNLP.

[86]  Mirella Lapata,et al.  Dependency-Based Construction of Semantic Space Models , 2007, CL.

[87]  Joo-Hwee Lim,et al.  Latent semantic fusion model for image retrieval and annotation , 2007, CIKM '07.

[88]  Andrew Zisserman,et al.  Image Classification using Random Forests and Ferns , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[89]  Andrew C. Connolly,et al.  Effect of congenital blindness on the semantic representation of some everyday concepts , 2007, Proceedings of the National Academy of Sciences.

[90]  J. Bullinaria,et al.  Extracting semantic representations from word co-occurrence statistics: A computational study , 2007, Behavior research methods.

[91]  Alessandro Lenci,et al.  Concepts and properties in word spaces , 2008 .

[92]  Magnus Sahlgren,et al.  The Distributional Hypothesis , 2008 .

[93]  L. Barsalou Grounded cognition. , 2008, Annual review of psychology.

[94]  Russell A. Epstein Parahippocampal and retrosplenial contributions to human spatial navigation , 2008, Trends in Cognitive Sciences.

[95]  Hugo Jair Escalante,et al.  Late fusion of heterogeneous methods for multimedia image retrieval , 2008, MIR '08.

[96]  Andrew Zisserman,et al.  Scene Classification Using a Hybrid Generative/Discriminative Approach , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[97]  Tom Michael Mitchell,et al.  Predicting Human Brain Activity Associated with the Meanings of Nouns , 2008, Science.

[98]  Nikolaus Kriegeskorte,et al.  Frontiers in Systems Neuroscience Systems Neuroscience , 2022 .

[99]  Jonathon S. Hare,et al.  Semantic spaces revisited: investigating the performance of auto-annotation and semantic retrieval using semantic spaces , 2008, CIVR '08.

[100]  A. Glenberg,et al.  Symbols and Embodiment: Debates on Meaning and Cognition , 2008 .

[101]  R. McIntosh,et al.  Two visual streams for perception and action: Current trends , 2009, Neuropsychologia.

[102]  Ivan Laptev,et al.  Improving object detection with boosted histograms , 2009, Image Vis. Comput..

[103]  Geoffrey E. Hinton,et al.  Zero-shot Learning with Semantic Output Codes , 2009, NIPS.

[104]  Gabriella Vigliocco,et al.  Integrating experiential and distributional data to learn semantic representations. , 2009, Psychological review.

[105]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[106]  Hinrich Schütze,et al.  Unsupervised Classification with Dependency Based Word Spaces , 2009 .

[107]  Meng Wang,et al.  Visual query suggestion , 2009, ACM Multimedia.

[108]  Eneko Agirre,et al.  A Study on Similarity and Relatedness Using Distributional and WordNet-based Approaches , 2009, NAACL.

[109]  David J. Therriault,et al.  The role of color diagnosticity in object recognition and representation , 2009, Cognitive Processing.

[110]  Cordelia Schmid,et al.  Combining efficient object localization and image classification , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[111]  Christoph H. Lampert,et al.  Efficient Subwindow Search: A Branch and Bound Framework for Object Localization , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[112]  Luc Van Gool,et al.  The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.

[113]  Gang Wang,et al.  Joint learning of visual attributes, object classes and visual saliency , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[114]  Ali Farhadi,et al.  Describing objects by their attributes , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[115]  David A. McAllester,et al.  Object Detection with Discriminatively Trained Part Based Models , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[116]  William W. Graves,et al.  Where is the semantic system? A critical review and meta-analysis of 120 functional neuroimaging studies. , 2009, Cerebral cortex.

[117]  A. Sack Parietal cortex and spatial cognition , 2009, Behavioural Brain Research.

[118]  Elisabeth Dévière,et al.  Analyzing linguistic data: a practical introduction to statistics using R , 2009 .

[119]  Arnold W. M. Smeulders,et al.  Real-Time Visual Concept Classification , 2010, IEEE Transactions on Multimedia.

[120]  Koen E. A. van de Sande,et al.  Evaluating Color Descriptors for Object and Scene Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[121]  Yansong Feng,et al.  Visual Information in Semantic Representation , 2010, NAACL.

[122]  Yihong Gong,et al.  Locality-constrained Linear Coding for image classification , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[123]  Thomas Mensink,et al.  Improving the Fisher Kernel for Large-Scale Image Classification , 2010, ECCV.

[124]  Patrick Pantel,et al.  From Frequency to Meaning: Vector Space Models of Semantics , 2010, J. Artif. Intell. Res..

[125]  Bernt Schiele,et al.  What helps where – and why? Semantic relatedness for knowledge transfer , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[126]  Alessandro Lenci,et al.  Distributional Memory: A General Framework for Corpus-Based Semantics , 2010, CL.

[127]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[128]  Massimo Poesio,et al.  Strudel: A distributional semantic model based on properties and types , 2010 .

[129]  M. Steyvers Combining Feature Norms and Text Data with Topic Models , 2022 .

[130]  Alexander C. Berg,et al.  Automatic Attribute Discovery and Characterization from Noisy Web Data , 2010, ECCV.

[131]  Sven J. Dickinson,et al.  Using Language to Learn Structured Appearance Models for Image Annotation , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[132]  Raymond J. Mooney,et al.  Multi-Prototype Vector-Space Models of Word Meaning , 2010, NAACL.

[133]  Andrea Vedaldi,et al.  Vlfeat: an open and portable library of computer vision algorithms , 2010, ACM Multimedia.

[134]  Cyrus Rashtchian,et al.  Every Picture Tells a Story: Generating Sentences from Images , 2010, ECCV.

[135]  Francisco Pereira,et al.  Generating Text from Functional Brain Images , 2011, Front. Hum. Neurosci..

[136]  Elia Bruni,et al.  Distributional semantics from text and images , 2011, GEMS.

[137]  Yejin Choi,et al.  Baby talk: Understanding and generating simple image descriptions , 2011, CVPR 2011.

[138]  Max M. Louwerse,et al.  Symbol Interdependency in Symbolic and Embodied Cognition , 2011, Top. Cogn. Sci..

[139]  Arnold W. M. Smeulders,et al.  The Visual Extent of an Object , 2011, International Journal of Computer Vision.

[140]  Koen E. A. van de Sande,et al.  Segmentation as selective search for object recognition , 2011, 2011 International Conference on Computer Vision.

[141]  Raymond J. Mooney,et al.  Learning to Interpret Natural Language Navigation Instructions from Observations , 2011, Proceedings of the AAAI Conference on Artificial Intelligence.

[142]  Saif Mohammad,et al.  Colourful Language: Measuring Word-Colour Associations , 2011, CMCL@ACL.

[143]  Michael N. Jones,et al.  Redundancy in Perceptual and Linguistic Experience: Comparing Feature-Based and Distributional Models of Semantic Representation , 2010, Top. Cogn. Sci..

[144]  Richard Szeliski,et al.  Computer Vision - Algorithms and Applications , 2011, Texts in Computer Science.

[145]  Randy Goebel,et al.  Using Visual Information to Predict Lexical Preference , 2011, RANLP.

[146]  Tom M. Mitchell,et al.  Quantitative modeling of the neural representation of objects: How semantic feature norms can account for fMRI activation , 2011, NeuroImage.

[147]  Andrew Zisserman,et al.  The devil is in the details: an evaluation of recent feature encoding methods , 2011, BMVC.

[148]  Rada Mihalcea,et al.  Going Beyond Text: A Hybrid Image-Text Approach for Measuring Word Relatedness , 2011, IJCNLP.

[149]  Brent Kievit-Kylar,et al.  The Semantic Pictionary Project , 2011, CogSci.

[150]  Evgeniy Gabrilovich,et al.  A word at a time: computing word relatedness using temporal semantic analysis , 2011, WWW.

[151]  Max M. Louwerse,et al.  A Taste of Words: Linguistic Context and Perceptual Simulation Predict the Modality of Words , 2011, Cogn. Sci..

[152]  Yair Neuman,et al.  Literal and Metaphorical Sense Identification through Concrete and Abstract Context , 2011, EMNLP.

[153]  Alessandro Lenci,et al.  How we BLESSed distributional semantic evaluation , 2011, GEMS.

[154]  Arnold W. M. Smeulders,et al.  Text and image subject classifiers: dense works better , 2011, MM '11.

[155]  D. McDermott LANGUAGE OF THOUGHT , 2012 .

[156]  Katrin Erk,et al.  Vector Space Models of Word Meaning and Phrase Meaning: A Survey , 2012, Lang. Linguistics Compass.

[157]  Tom M. Mitchell,et al.  Selecting Corpus-Semantic Models for Neurolinguistic Decoding , 2012, *SEMEVAL.

[158]  Luke S. Zettlemoyer,et al.  A Joint Model of Language and Perception for Grounded Attribute Learning , 2012, ICML.

[159]  John A Bullinaria,et al.  Extracting semantic representations from word co-occurrence statistics: stop-lists, stemming, and SVD , 2012, Behavior Research Methods.

[160]  Gemma Boleda,et al.  Distributional Semantics in Technicolor , 2012, ACL.

[161]  Fabio A. González,et al.  Multimodal representation, indexing, automated annotation and retrieval of image collections via non-negative matrix factorization , 2012, Neurocomputing.

[162]  Nicu Sebe,et al.  Distributional semantics with eyes: using image analysis to improve computational representations of word meaning , 2012, ACM Multimedia.

[163]  Max M. Louwerse,et al.  From Head to Toe: Embodiment Through Statistical Linguistic Frequencies , 2012, CogSci.

[164]  Michael N. Jones,et al.  Perceptual Inference Through Global Lexical Similarity , 2012, Top. Cogn. Sci..

[165]  Carina Silberer,et al.  Grounded Models of Semantic Representation , 2012, EMNLP.

[166]  Michael N. Jones,et al.  The semantic richness of abstract concepts , 2012, Front. Hum. Neurosci..

[167]  Jack L. Gallant,et al.  A Continuous Semantic Space Describes the Representation of Thousands of Object and Action Categories across the Human Brain , 2012, Neuron.

[168]  Koen E. A. van de Sande,et al.  Selective Search for Object Recognition , 2013, International Journal of Computer Vision.

[169]  Carina Silberer,et al.  Models of Semantic Representation with Visual Attributes , 2013, ACL.

[170]  M. Engelmann The Philosophical Investigations , 2013 .

[171]  Lewis D. Griffin,et al.  Distributional Learning of Appearance , 2013, PloS one.

[172]  Elia Bruni,et al.  VSEM: An open library for visual semantics representation , 2013, ACL.

[173]  Stephen Clark,et al.  Vector Space Models of Lexical Meaning , 2015 .