Visually Grounded Meaning Representations

In this paper we address the problem of grounding distributional representations of lexical meaning. We introduce a new model which uses stacked autoencoders to learn higher-level representations from textual and visual input. The visual modality is encoded via vectors of attributes obtained automatically from images. We create a new large-scale taxonomy of 600 visual attributes representing more than 500 concepts and 700 K images. We use this dataset to train attribute classifiers and integrate their predictions with text-based distributional models of word meaning. We evaluate our model on its ability to simulate word similarity judgments and concept categorization. On both tasks, our model yields a better fit to behavioral data compared to baselines and related models which either rely on a single modality or do not make use of attribute-based input.

[1]  J. Gabrieli,et al.  Effects of Semantic and Associative Relatedness on Automatic Priming , 1998 .

[2]  Geoffrey E. Hinton,et al.  Learning internal representations by error propagation , 1986 .

[3]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[4]  Elia Bruni,et al.  Multimodal Distributional Semantics , 2014, J. Artif. Intell. Res..

[5]  Pascal Vincent,et al.  Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion , 2010, J. Mach. Learn. Res..

[6]  James L. McClelland,et al.  Parallel distributed processing: explorations in the microstructure of cognition, vol. 1: foundations , 1986 .

[7]  Gemma Boleda,et al.  Distributional Semantics in Technicolor , 2012, ACL.

[8]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[9]  T. Landauer,et al.  A Solution to Plato's Problem: The Latent Semantic Analysis Theory of Acquisition, Induction, and Representation of Knowledge. , 1997 .

[10]  Luc Van Gool,et al.  The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.

[11]  Yansong Feng,et al.  Visual Information in Semantic Representation , 2010, NAACL.

[12]  Geoffrey Zweig,et al.  Linguistic Regularities in Continuous Space Word Representations , 2013, NAACL.

[13]  Jianguo Zhang,et al.  The PASCAL Visual Object Classes Challenge , 2006 .

[14]  L. Barsalou Grounded cognition. , 2008, Annual review of psychology.

[15]  Mirella Lapata,et al.  Meaning Representation in Natural Language Categorization , 2010 .

[16]  S C McKinley,et al.  Investigations of exemplar and decision bound models in large, ill-defined category structures. , 1995, Journal of experimental psychology. Human perception and performance.

[17]  Anna Korhonen,et al.  Acquiring Human-like Feature-Based Conceptual Representations from Corpora , 2010, HLT-NAACL 2010.

[18]  John R. Anderson,et al.  The Adaptive Nature of Human Categorization , 1991 .

[19]  Robert L. Goldstone,et al.  Concepts and Categorization , 2003 .

[20]  Mark Steyvers,et al.  Topics in semantic representation. , 2007, Psychological review.

[21]  Ivan Laptev,et al.  Learning and Transferring Mid-level Image Representations Using Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[22]  Yoshua. Bengio,et al.  Learning Deep Architectures for AI , 2007, Found. Trends Mach. Learn..

[23]  Jeff A. Bilmes,et al.  Deep Canonical Correlation Analysis , 2013, ICML.

[24]  Yoshua Bengio,et al.  Extracting and composing robust features with denoising autoencoders , 2008, ICML '08.

[25]  Thomas L. Griffiths,et al.  Identifying representations of categories of discrete items using Markov chain Monte Carlo with People , 2012, CogSci.

[26]  Léon Bottou,et al.  Learning Image Embeddings using Convolutional Neural Networks for Improved Multi-Modal Semantics , 2014, EMNLP.

[27]  Kun Duan,et al.  Discovering localized attributes for fine-grained recognition , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[28]  Elia Bruni,et al.  VSEM: An open library for visual semantics representation , 2013, ACL.

[29]  Robert L. Goldstone,et al.  22 Concepts and Categorization , 2012 .

[30]  Mirella Lapata,et al.  Incremental Models of Natural Language Category Acquisition , 2011, CogSci.

[31]  Wei Xu,et al.  Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN) , 2014, ICLR.

[32]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[33]  Ming-Wei Chang,et al.  Question Answering Using Enhanced Lexical Semantic Models , 2013, ACL.

[34]  Nitish Srivastava,et al.  Multimodal learning with deep Boltzmann machines , 2012, J. Mach. Learn. Res..

[35]  Ehud Rivlin,et al.  Placing search in context: the concept revisited , 2002, TOIS.

[36]  Thomas L. Griffiths,et al.  A more rational model of categorization , 2006 .

[37]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[38]  Christoph H. Lampert,et al.  Learning to detect unseen object classes by between-class attribute transfer , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[39]  Chunyan Miao,et al.  Online multimodal deep similarity learning with application to image retrieval , 2013, ACM Multimedia.

[40]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[41]  Silvio Savarese,et al.  Recognizing human actions by attributes , 2011, CVPR 2011.

[42]  G. Miller,et al.  Contextual correlates of semantic similarity , 1991 .

[43]  Elia Bruni,et al.  Distributional semantics from text and images , 2011, GEMS.

[44]  Marc'Aurelio Ranzato,et al.  Semi-supervised learning of compact document representations with deep networks , 2008, ICML '08.

[45]  G. Kane Parallel Distributed Processing: Explorations in the Microstructure of Cognition, vol 1: Foundations, vol 2: Psychological and Biological Models , 1994 .

[46]  G LoweDavid,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004 .

[47]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[48]  Gert Westermann,et al.  From perceptual to language-mediated categorization , 2014, Philosophical Transactions of the Royal Society B: Biological Sciences.

[49]  Jon A. Willits,et al.  Models of Semantic Memory , 2015 .

[50]  Laura A. Dabbish,et al.  Labeling images with a computer game , 2004, AAAI Spring Symposium: Knowledge Collection from Volunteer Contributors.

[51]  Thomas A. Schreiber,et al.  The University of South Florida free association, rhyme, and word fragment norms , 2004, Behavior research methods, instruments, & computers : a journal of the Psychonomic Society, Inc.

[52]  Daniel,et al.  Default Probability , 2004 .

[53]  Felix Hill,et al.  Learning Abstract Concept Embeddings from Multi-Modal Data: Since You Probably Can’t See What I Mean , 2014, EMNLP.

[54]  Juhan Nam,et al.  Multimodal Deep Learning , 2011, ICML.

[55]  Honglak Lee,et al.  Improved Multimodal Deep Learning with Variation of Information , 2014, NIPS.

[56]  Chris McNorgan,et al.  An attractor model of lexical conceptual processing: simulating semantic priming , 1999, Cogn. Sci..

[57]  Shree K. Nayar,et al.  Ieee Transactions on Pattern Analysis and Machine Intelligence Describable Visual Attributes for Face Verification and Image Search , 2022 .

[58]  Yoshua Bengio,et al.  Greedy Layer-Wise Training of Deep Networks , 2006, NIPS.

[59]  Jing Huang,et al.  Audio-visual deep learning for noise robust speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[60]  Ruslan Salakhutdinov,et al.  Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models , 2014, ArXiv.

[61]  John Shawe-Taylor,et al.  Canonical Correlation Analysis: An Overview with Application to Learning Methods , 2004, Neural Computation.

[62]  Christian Biemann,et al.  Chinese Whispers - an Efficient Graph Clustering Algorithm and its Application to Natural Language Processing Problems , 2006 .

[63]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[64]  Carina Silberer,et al.  Models of Semantic Representation with Visual Attributes , 2013, ACL.

[65]  Fernando Gomez,et al.  A New Set of Norms for Semantic Relatedness Measures , 2013, ACL.

[66]  Gabriella Vigliocco,et al.  Integrating experiential and distributional data to learn semantic representations. , 2009, Psychological review.

[67]  Andrew Y. Ng,et al.  Zero-Shot Learning Through Cross-Modal Transfer , 2013, NIPS.

[68]  James Hays,et al.  SUN attribute database: Discovering, annotating, and recognizing scene attributes , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[69]  Marc'Aurelio Ranzato,et al.  DeViSE: A Deep Visual-Semantic Embedding Model , 2013, NIPS.

[70]  Benjamin Rey,et al.  Generating query substitutions , 2006, WWW '06.

[71]  Massimo Poesio,et al.  Strudel: A Corpus-Based Semantic Model Based on Properties and Types , 2010, Cogn. Sci..

[72]  Luc Van Gool,et al.  The 2005 PASCAL Visual Object Classes Challenge , 2005, MLCW.

[73]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[74]  Luc Van Gool,et al.  The Pascal Visual Object Classes Challenge: A Retrospective , 2014, International Journal of Computer Vision.

[75]  Patrick Pantel,et al.  From Frequency to Meaning: Vector Space Models of Semantics , 2010, J. Artif. Intell. Res..

[76]  Quoc V. Le,et al.  Grounded Compositional Semantics for Finding and Describing Images with Sentences , 2014, TACL.

[77]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[78]  Michael P. Kaschak,et al.  Grounding language in action , 2002, Psychonomic bulletin & review.

[79]  David P Vinson,et al.  Semantic feature production norms for a large set of objects and events , 2008, Behavior research methods.

[80]  Carina Silberer,et al.  Grounded Models of Semantic Representation , 2012, EMNLP.

[81]  Jason Weston,et al.  Natural Language Processing (Almost) from Scratch , 2011, J. Mach. Learn. Res..

[82]  Mark S. Seidenberg,et al.  Semantic feature production norms for a large set of living and nonliving things , 2005, Behavior research methods.

[83]  Ted Pedersen,et al.  Using WordNet-based Context Vectors to Estimate the Semantic Relatedness of Concepts , 2006 .

[84]  Thierry Poibeau,et al.  Towards Unrestricted, Large-Scale Acquisition of Feature-Based Conceptual Representations from Corpus Data , 2009 .

[85]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[86]  Sabine Schulte im Walde,et al.  A Multimodal LDA Model integrating Textual, Cognitive and Visual Modalities , 2013, EMNLP.

[87]  Eneko Agirre,et al.  Semeval-2007 Task 2 : Evaluating Word Sense Induction and Discrimination , 2007 .

[88]  Andrew Zisserman,et al.  Learning Visual Attributes , 2007, NIPS.

[89]  Geoffrey E. Hinton,et al.  Lesioning an attractor network: investigations of acquired dyslexia. , 1991, Psychological review.

[90]  Ali Farhadi,et al.  Describing objects by their attributes , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[91]  Zellig S. Harris,et al.  Distributional Structure , 1954 .

[92]  James L. McClelland,et al.  Structure and deterioration of semantic memory: a neuropsychological and computational investigation. , 2004, Psychological review.

[93]  James L. McClelland,et al.  A computational model of semantic memory impairment: modality specificity and emergent category specificity. , 1991, Journal of experimental psychology. General.

[94]  Michael N. Jones,et al.  Perceptual Inference Through Global Lexical Similarity , 2012, Top. Cogn. Sci..

[95]  Fei-Fei Li,et al.  Deep visual-semantic alignments for generating image descriptions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[96]  Angeliki Lazaridou,et al.  Combining Language and Vision with a Multimodal Skip-gram Model , 2015, NAACL.

[97]  Mirella Lapata,et al.  Incremental Bayesian Learning of Semantic Categories , 2014, EACL.

[98]  Linda B. Smith,et al.  Object perception and object naming in early development , 1998, Trends in Cognitive Sciences.