Multimodal Semantic Learning from Child-Directed Input

Children learn the meaning of words by being exposed to perceptually rich situations (linguistic discourse, visual scenes, etc). Current computational learning models typically simulate these rich situations through impoverished symbolic approximations. In this work, we present a distributed word learning model that operates on child-directed speech paired with realistic visual scenes. The model integrates linguistic and extra-linguistic information (visual and social cues), handles referential uncertainty, and correctly learns to associate words with objects, even in cases of limited linguistic exposure.

[1]  Deb Roy,et al.  A Computational Model of Word Learning from Multimodal Sensory Input , 2000 .

[2]  Brent Kievit-Kylar,et al.  The Semantic Pictionary Project , 2011, CogSci.

[3]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[4]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[5]  Alessandro Lenci,et al.  ISA meets Lara: An incremental word space model for cognitively plausible simulations of semantic learning , 2007, ACL 2007.

[6]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[7]  David Schlangen,et al.  Simple Learning and Compositional Application of Perceptually Grounded Word Meanings for Incremental Reference Resolution , 2015, ACL.

[8]  Elia Bruni,et al.  Multimodal Distributional Semantics , 2014, J. Artif. Intell. Res..

[9]  Deb Roy,et al.  New horizons in the study of child language acquisition , 2009, INTERSPEECH.

[10]  Andreas Krause,et al.  Advances in Neural Information Processing Systems (NIPS) , 2014 .

[11]  Grzegorz Chrupala,et al.  Learning word meanings from images of natural scenes , 2014, Trait. Autom. des Langues.

[12]  Linda B. Smith,et al.  Joint Attention without Gaze Following: Human Infants and Their Parents Coordinate Visual Attention to Objects through Eye-Hand Coordination , 2013, PloS one.

[13]  Angeliki Lazaridou,et al.  Combining Language and Vision with a Multimodal Skip-gram Model , 2015, NAACL.

[14]  Michael C. Frank,et al.  Predicting the birth of a spoken word , 2015, Proceedings of the National Academy of Sciences.

[15]  Chen Yu,et al.  A unified model of early word learning: Integrating statistical and social cues , 2007, Neurocomputing.

[16]  Ruslan Salakhutdinov,et al.  Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models , 2014, ArXiv.

[17]  L. Gleitman,et al.  Propose but verify: Fast mapping meets cross-situational word learning , 2013, Cognitive Psychology.

[18]  Jack Sidnell,et al.  Introduction: Multimodal interaction , 2005 .

[19]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[20]  Michael C. Frank,et al.  A Bayesian Framework for Cross-Situational Word-Learning , 2007, NIPS.

[21]  Brent Kievit-Kylar,et al.  Naturalistic Word-Concept Pair Learning With Semantic Spaces , 2013, CogSci.

[22]  L. Rescorla,et al.  Overextension in early language development , 1980, Journal of Child Language.

[23]  B. MacWhinney The CHILDES project: tools for analyzing talk , 1992 .

[24]  Chen Yu,et al.  The emergence of links between lexical acquisition and object categorization: a computational study , 2005, Connect. Sci..

[25]  Carina Silberer,et al.  Learning Grounded Meaning Representations with Autoencoders , 2014, ACL.

[26]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[27]  Antonio Torralba,et al.  Where are they looking? , 2015, NIPS.

[28]  Afsaneh Fazly,et al.  A Probabilistic Computational Model of Cross-Situational Word Learning , 2010, Cogn. Sci..