A Computational Model of Early Word Learning from the Infant's Point of View

Human infants have the remarkable ability to learn the associations between object names and visual objects from inherently ambiguous experiences. Researchers in cognitive science and developmental psychology have built formal models that implement in-principle learning algorithms, and then used pre-selected and pre-cleaned datasets to test the abilities of the models to find statistical regularities in the input data. In contrast to previous modeling approaches, the present study used egocentric video and gaze data collected from infant learners during natural toy play with their parents. This allowed us to capture the learning environment from the perspective of the learner's own point of view. We then used a Convolutional Neural Network (CNN) model to process sensory data from the infant's point of view and learn name-object associations from scratch. As the first model that takes raw egocentric video to simulate infant word learning, the present study provides a proof of principle that the problem of early word learning can be solved, using actual visual data perceived by infant learners. Moreover, we conducted simulation experiments to systematically determine how visual, perceptual, and attentional properties of infants' sensory experiences may affect word learning.

[1]  Wilson S. Geisler,et al.  Gaze-contingent real-time simulation of arbitrary visual fields , 2002, IS&T/SPIE Electronic Imaging.

[2]  Linda B. Smith,et al.  Toddler-Inspired Visual Object Learning , 2018, NeurIPS.

[3]  Alex Pentland,et al.  Learning words from sights and sounds: a computational model , 2002, Cogn. Sci..

[4]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Afsaneh Fazly,et al.  A Probabilistic Computational Model of Cross-Situational Word Learning , 2010, Cogn. Sci..

[6]  Okko Johannes Räsänen,et al.  A computational model of early language acquisition from audiovisual experiences of young infants , 2019, INTERSPEECH.

[7]  Linda B. Smith,et al.  The Developing Infant Creates a Curriculum for Statistical Learning , 2018, Trends in Cognitive Sciences.

[8]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[9]  Chen Yu,et al.  The Multisensory Nature of Verbal Discourse in Parent–Toddler Interactions , 2016, Developmental neuropsychology.

[10]  M. Rothbart,et al.  Attention in Early Development: Themes and Variations , 1996 .

[11]  Linda B. Smith,et al.  Active Viewing in Toddlers Facilitates Visual Object Learning: An Egocentric Vision Approach , 2016, CogSci.

[12]  Emmanuel Dupoux,et al.  Cognitive science in the era of artificial intelligence: A roadmap for reverse-engineering the infant language-learner , 2016, Cognition.

[13]  Willard Van Orman Quine,et al.  Word and Object , 1960 .

[14]  Chen Yu,et al.  Embodied attention and word learning by toddlers , 2012, Cognition.

[15]  Chen Yu,et al.  Observing and Modeling Developing Knowledge and Uncertainty During Cross-Situational Word Learning , 2018, IEEE Transactions on Cognitive and Developmental Systems.

[16]  R. Lewin,et al.  MASTERING THE GAME , 1998 .

[17]  Michael C. Frank,et al.  Social and Discourse Contributions to the Determination of Reference in Cross-Situational Word Learning , 2013 .

[18]  Chen Yu,et al.  Gaze in Action: Head-mounted Eye Tracking of Children's Dynamic Visual Attention During Naturalistic Behavior. , 2018, Journal of visualized experiments : JoVE.

[19]  Roberta Michnick Golinkoff,et al.  Becoming a word learner : a debate on lexical acquisition , 2000 .

[20]  Grzegorz Chrupala,et al.  Learning language through pictures , 2015, ACL.

[21]  Jian Sun,et al.  Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[22]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[23]  Chen Yu,et al.  Infant sustained attention but not joint attention to objects at 9 months predicts vocabulary at 12 and 15 months. , 2018, Developmental science.

[24]  Chen Yu,et al.  A unified model of early word learning: Integrating statistical and social cues , 2007, Neurocomputing.

[25]  Michael C. Frank,et al.  PSYCHOLOGICAL SCIENCE Research Article Using Speakers ’ Referential Intentions to Model Early Cross-Situational Word Learning , 2022 .

[26]  Kenny Smith,et al.  Cross-Situational Learning: An Experimental Study of Word-Learning Mechanisms , 2011, Cogn. Sci..