Learning word meanings from images of natural scenes

Children early on face the challenge of learning the meaning of words from noisy and ambiguous contexts. Utterances that guide their learning are emitted in complex scenes rendering the mapping between visual and linguistic cues difficult. A key challenge in computational modeling of the acquisition of word meanings is to provide representations of scenes that contain sources of information and statistical properties similar in complexity to natural data. We propose a novel computational model of cross-situational word learning that takes images of natural scenes paired with their descriptions as input and incrementally learns probabilistic associations between words and image features. Through a set of experiments we show that the model learns meaning representations that correlate with human similarity judgments, and that given an image of a scene it produces words conceptually related to the image. RÉSUMÉ. Les enfants sont très tôt confrontés au défi d’apprendre la signification des mots à partir de contextes bruités et ambigus. Les énoncés qui guident leur apprentissage sont émis au sein de scènes complexes qui rendent l’appariement entre indices visuels et linguistiques difficile. Un défi important de la modélisation informatique de l’acquisition du sens des mots réside en la proposition de représentations de scènes contenant des sources d’information et des propriétés statistiques similaires en complexité à des données naturelles. Nous proposons un nouveau modèle d’apprentissage de mots inter-situationnel qui prend en entrée des images de scènes naturelles accompagnées de leurs descriptions et apprend incrémentalement des associations probabilistes entre mots et traits visuels. Nous montrons, à travers un ensemble d’expériences, que ce modèle apprend des représentations de sens corrélées aux jugements de similarité humains, et qu’il produit, pour une image de scène donnée, des mots qui lui sont conceptuellement liés.

[1]  Willard Van Orman Quine,et al.  Word and Object , 1960 .

[2]  John B. Goodenough,et al.  Contextual correlates of synonymy , 1965, CACM.

[3]  S. Carey The child as word learner , 1978 .

[4]  B. Efron The jackknife, the bootstrap, and other resampling plans , 1987 .

[5]  G. Miller,et al.  Contextual correlates of semantic similarity , 1991 .

[6]  Letitia R. Naigles,et al.  Learnability and Cognition: The Acquisition of Argument Structure , 1991 .

[7]  Robert L. Mercer,et al.  The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.

[8]  J. Siskind A computational study of cross-situational techniques for learning word-to-meaning mappings , 1996, Cognition.

[9]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[10]  B. MacWhinney The Childes Project: Tools for Analyzing Talk, Volume I: Transcription format and Programs , 2000 .

[11]  Ehud Rivlin,et al.  Placing search in context: the concept revisited , 2002, TOIS.

[12]  Thomas A. Schreiber,et al.  The University of South Florida free association, rhyme, and word fragment norms , 2004, Behavior research methods, instruments, & computers : a journal of the Psychonomic Society, Inc.

[13]  Ping Li,et al.  Early lexical development in a self-organizing neural network , 2004, Neural Networks.

[14]  Terry Regier,et al.  The Emergence of Words: Attentional Learning in Form and Meaning , 2005, Cogn. Sci..

[15]  Michael C. Frank,et al.  A Bayesian Framework for Cross-Situational Word-Learning , 2007, NIPS.

[16]  Linda B. Smith,et al.  Rapid Word Learning Under Uncertainty via Cross-Situational Statistics , 2007, Psychological science.

[17]  Chen Yu,et al.  A unified model of early word learning: Integrating statistical and social cues , 2007, Neurocomputing.

[18]  A. Vouloumanos Fine-grained sensitivity to statistical information in adult word learning , 2008, Cognition.

[19]  Chen Yu,et al.  A Statistical Associative Account of Vocabulary Growth in Early Word Learning , 2008 .

[20]  Linda B. Smith,et al.  Infants rapidly learn word-referent mappings via cross-situational statistics , 2008, Cognition.

[21]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[22]  J. Werker,et al.  Infants' learning of novel words in a stochastic environment. , 2009, Developmental psychology.

[23]  Afsaneh Fazly,et al.  A Probabilistic Computational Model of Cross-Situational Word Learning , 2010, Cogn. Sci..

[24]  Cyrus Rashtchian,et al.  Collecting Image Annotations Using Amazon’s Mechanical Turk , 2010, Mturk@HLT-NAACL.

[25]  Max M. Louwerse,et al.  Symbol Interdependency in Symbolic and Embodied Cognition , 2011, Top. Cogn. Sci..

[26]  Evgeniy Gabrilovich,et al.  A word at a time: computing word relatedness using temporal semantic analysis , 2011, WWW.

[27]  Yair Neuman,et al.  Literal and Metaphorical Sense Identification through Concrete and Abstract Context , 2011, EMNLP.

[28]  Marc Hanheide,et al.  A system for interactive learning in dialogue with a tutor , 2011, 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[29]  Grzegorz Chrupala,et al.  Concurrent Acquisition of Word Meaning and Lexical Categories , 2012, EMNLP.

[30]  Michael N. Jones,et al.  The semantic richness of abstract concepts , 2012, Front. Hum. Neurosci..

[31]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[32]  Paul Vogt,et al.  Automatic generation of naturalistic child-adult interaction data , 2013, CogSci.

[33]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[34]  Afsaneh Fazly,et al.  Word Learning in the Wild: What Natural Data Can Tell Us , 2013, CogSci.

[35]  Ruslan Salakhutdinov,et al.  Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models , 2014, ArXiv.

[36]  Peter Young,et al.  From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions , 2014, TACL.

[37]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[38]  Wei Xu,et al.  Explain Images with Multimodal Recurrent Neural Networks , 2014, ArXiv.

[39]  Elia Bruni,et al.  Multimodal Distributional Semantics , 2014, J. Artif. Intell. Res..

[40]  Xinlei Chen,et al.  Learning a Recurrent Visual Representation for Image Caption Generation , 2014, ArXiv.

[41]  Geoffrey Zweig,et al.  From captions to visual concepts and back , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Felix Hill,et al.  SimLex-999: Evaluating Semantic Models With (Genuine) Similarity Estimation , 2014, CL.

[43]  Samy Bengio,et al.  Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Angeliki Lazaridou,et al.  Combining Language and Vision with a Multimodal Skip-gram Model , 2015, NAACL.

[45]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[46]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[47]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[48]  Fei-Fei Li,et al.  Deep visual-semantic alignments for generating image descriptions , 2015, CVPR.

[49]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).