Models of Semantic Representation with Visual Attributes

We consider the problem of grounding the meaning of words in the physical world and focus on the visual modality which we represent by visual attributes. We create a new large-scale taxonomy of visual attributes covering more than 500 concepts and their corresponding 688K images. We use this dataset to train attribute classifiers and integrate their predictions with text-based distributional models of word meaning. We show that these bimodal models give a better fit to human word association data compared to amodal models and word representations based on handcrafted norming data.

[1]  Daniel N. Osherson,et al.  Joshua Stern, Ormond Wilkie, Michael Stob, Edward E. Smith: Default Probability , 1991, Cogn. Sci..

[2]  Ali Farhadi,et al.  Describing objects by their attributes , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[3]  James Ze Wang,et al.  Image retrieval: Ideas, influences, and trends of the new age , 2008, CSUR.

[4]  Luke S. Zettlemoyer,et al.  Reinforcement Learning for Mapping Instructions to Actions , 2009, ACL.

[5]  Deb Roy,et al.  Grounded Semantic Composition for Visual Scenes , 2011, J. Artif. Intell. Res..

[6]  Nicu Sebe,et al.  Distributional semantics with eyes: using image analysis to improve computational representations of word meaning , 2012, ACM Multimedia.

[7]  Yansong Feng,et al.  Automatic Image Annotation Using Auxiliary Text Information , 2008, ACL.

[8]  Matthew R. Walter,et al.  Understanding Natural Language Commands for Robotic Navigation and Mobile Manipulation , 2011, AAAI.

[9]  Gemma Boleda,et al.  Distributional Semantics in Technicolor , 2012, ACL.

[10]  S. Sloman,et al.  Similarity as an explanatory construct , 1998, Cognition.

[11]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[12]  Andrew Zisserman,et al.  Learning Visual Attributes , 2007, NIPS.

[13]  Silvio Savarese,et al.  Recognizing human actions by attributes , 2011, CVPR 2011.

[14]  Juhan Nam,et al.  Multimodal Deep Learning , 2011, ICML.

[15]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[16]  Michael N. Jones,et al.  Perceptual Inference Through Global Lexical Similarity , 2012, Top. Cogn. Sci..

[17]  Michael D Lee,et al.  Finding the features that represent stimuli. , 2010, Acta psychologica.

[18]  Laura A. Dabbish,et al.  Labeling images with a computer game , 2004, AAAI Spring Symposium: Knowledge Collection from Volunteer Contributors.

[19]  Carina Silberer,et al.  Grounded Models of Semantic Representation , 2012, EMNLP.

[20]  Mark Johnson,et al.  Reducing Grounded Learning Tasks To Grammatical Inference , 2011, EMNLP.

[21]  Jeffrey Mark Siskind,et al.  Grounding the Lexical Semantics of Verbs in Visual Perception using Force Dynamics and Event Logic , 1999, J. Artif. Intell. Res..

[22]  M. Steyvers Combining Feature Norms and Text Data with Topic Models , 2022 .

[23]  Mark S. Seidenberg,et al.  Semantic feature production norms for a large set of living and nonliving things , 2005, Behavior research methods.

[24]  Thomas A. Schreiber,et al.  The University of South Florida free association, rhyme, and word fragment norms , 2004, Behavior research methods, instruments, & computers : a journal of the Psychonomic Society, Inc.

[25]  Daniel,et al.  Default Probability , 2004 .

[26]  L. Barsalou Grounded cognition. , 2008, Annual review of psychology.

[27]  Gabriella Vigliocco,et al.  Integrating experiential and distributional data to learn semantic representations. , 2009, Psychological review.

[28]  Franklin T. Luk,et al.  A Block Lanczos Method for Computing the Singular Values and Corresponding Singular Vectors of a Matrix , 1981, TOMS.

[29]  James Hays,et al.  SUN attribute database: Discovering, annotating, and recognizing scene attributes , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[30]  Chihli Hung and Chih-Fong Tsai,et al.  Automatically Annotating Images with Keywords: A Review of Image Annotation Systems , 2008 .

[31]  Massimo Poesio,et al.  Strudel: A Corpus-Based Semantic Model Based on Properties and Types , 2010, Cogn. Sci..

[32]  Luc Van Gool,et al.  The 2005 PASCAL Visual Object Classes Challenge , 2005, MLCW.

[33]  David G. Lowe,et al.  Object recognition from local scale-invariant features , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[34]  Hwee Tou Ng,et al.  A Generative Model for Parsing Natural Language to Meaning Representations , 2008, EMNLP.

[35]  Yansong Feng,et al.  Visual Information in Semantic Representation , 2010, NAACL.

[36]  Mark Steyvers,et al.  Topics in semantic representation. , 2007, Psychological review.

[37]  Elia Bruni,et al.  Distributional semantics from text and images , 2011, GEMS.

[38]  Rohit J. Kate,et al.  Learning Language Semantics from Ambiguous Supervision , 2007, AAAI.

[39]  G LoweDavid,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004 .

[40]  Alex Pentland,et al.  Learning words from sights and sounds: a computational model , 2002, Cogn. Sci..

[41]  James Ze Wang,et al.  The Story Picturing Engine---a system for automatic text illustration , 2006, TOMCCAP.

[42]  Luke S. Zettlemoyer,et al.  Learning to Map Sentences to Logical Form: Structured Classification with Probabilistic Categorial Grammars , 2005, UAI.

[43]  John Shawe-Taylor,et al.  Canonical Correlation Analysis: An Overview with Application to Learning Methods , 2004, Neural Computation.

[44]  M. Bornstein,et al.  Cross-linguistic analysis of vocabulary in young children: spanish, dutch, French, hebrew, italian, korean, and american english. , 2004, Child development.

[45]  Rada Mihalcea,et al.  Going Beyond Text: A Hybrid Image-Text Approach for Measuring Word Relatedness , 2011, IJCNLP.

[46]  A. Torralba,et al.  The role of context in object recognition , 2007, Trends in Cognitive Sciences.

[47]  Pietro Perona,et al.  A Bayesian hierarchical model for learning natural scene categories , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[48]  Raymond J. Mooney,et al.  Learning to Parse Database Queries Using Inductive Logic Programming , 1996, AAAI/IAAI, Vol. 2.

[49]  Linda B. Smith,et al.  Object perception and object naming in early development , 1998, Trends in Cognitive Sciences.

[50]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[51]  Nitish Srivastava,et al.  Multimodal learning with deep Boltzmann machines , 2012, J. Mach. Learn. Res..

[52]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[53]  Christoph H. Lampert,et al.  Learning to detect unseen object classes by between-class attribute transfer , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[54]  Chen Yu,et al.  A unified model of early word learning: Integrating statistical and social cues , 2007, Neurocomputing.

[55]  Shree K. Nayar,et al.  Attribute and simile classifiers for face verification , 2009, 2009 IEEE 12th International Conference on Computer Vision.