A computational model of early language acquisition from audiovisual experiences of young infants

Earlier research has suggested that human infants might use statistical dependencies between speech and non-linguistic multimodal input to bootstrap their language learning before they know how to segment words from running speech. However, feasibility of this hypothesis in terms of real-world infant experiences has remained unclear. This paper presents a step towards a more realistic test of the multimodal bootstrapping hypothesis by describing a neural network model that can learn word segments and their meanings from referentially ambiguous acoustic input. The model is tested on recordings of real infant-caregiver interactions using utterance-level labels for concrete visual objects that were attended by the infant when caregiver spoke an utterance containing the name of the object, and using random visual labels for utterances during absence of attention. The results show that beginnings of lexical knowledge may indeed emerge from individually ambiguous learning scenarios. In addition, the hidden layers of the network show gradually increasing selectivity to phonetic categories as a function of layer depth, resembling models trained for phone recognition in a supervised manner.

[1]  Joris Driesen,et al.  Discovering Words in Speech using Matrix Factorization (Het ontdekken van woorden in spraak met behulp van matrixfactorisatie) , 2012 .

[2]  Odette Scharenborg,et al.  The Effects of Background Noise on Native and Non-native Spoken-word Recognition: A Computational Modelling Approach , 2018, CogSci.

[3]  Michael C. Frank,et al.  Wordbank: an open repository for developmental vocabulary data* , 2016, Journal of Child Language.

[4]  O. Räsänen A computational model of word segmentation from continuous speech using transitional probabilities of atomic acoustic events , 2011, Cognition.

[5]  Richard N Aslin,et al.  Nature and origins of the lexicon in 6-mo-olds , 2017, Proceedings of the National Academy of Sciences.

[6]  James R. Glass,et al.  Unsupervised Word Acquisition from Speech using Pattern Discovery , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[7]  Carla Teixeira Lopes,et al.  TIMIT Acoustic-Phonetic Continuous Speech Corpus , 2012 .

[8]  Keith Johnson,et al.  Phonetic Feature Encoding in Human Superior Temporal Gyrus , 2014, Science.

[9]  Chen Yu,et al.  A unified model of early word learning: Integrating statistical and social cues , 2007, Neurocomputing.

[10]  Angelo Cangelosi,et al.  Cross-Situational Learning with Bayesian Generative Models for Multimodal Category and Word Learning in Robots , 2017, Front. Neurorobot..

[11]  Erik D. Thiessen,et al.  Infant-Directed Speech Facilitates Word Segmentation. , 2005, Infancy : the official journal of the International Society on Infant Studies.

[12]  Heikki Rasilo,et al.  A joint model of word segmentation and meaning acquisition through cross-situational learning. , 2015, Psychological review.

[13]  Gregory Shakhnarovich,et al.  Semantic Speech Retrieval With a Visually Grounded Model of Untranscribed Speech , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[14]  Jay G Rueckl,et al.  EARSHOT: A Minimal Neural Network Model of Incremental Human Speech Recognition , 2020, Cogn. Sci..

[15]  Ron J. Weiss,et al.  Unsupervised Speech Representation Learning Using WaveNet Autoencoders , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[16]  Unto K. Laine,et al.  Computational language acquisition by statistical bottom-up processing , 2008, INTERSPEECH.

[17]  Dan Stowell,et al.  Deep Learning for Audio Event Detection and Tagging on Low-Resource Datasets , 2018, Applied Sciences.

[18]  Louis ten Bosch,et al.  A Computational Model of Language Acquisition: the Emergence of Words , 2009, Fundam. Informaticae.

[19]  Guillaume Aimetti,et al.  Modelling Early Language Acquisition Skills: Towards a General Statistical Learning Mechanism , 2009, EACL.

[20]  Sepp Hochreiter,et al.  Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs) , 2015, ICLR.

[21]  Alex Pentland,et al.  Learning words from sights and sounds: a computational model , 2002, Cogn. Sci..

[22]  Olivier Mangin,et al.  Learning semantic components from subsymbolic multimodal perception , 2013, 2013 IEEE Third Joint International Conference on Development and Learning and Epigenetic Robotics (ICDL).

[23]  Chen Yu,et al.  A multimodal learning interface for grounding spoken language in sensory perceptions , 2003, ICMI '03.

[24]  Aren Jansen,et al.  The zero resource speech challenge 2017 , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[25]  Tasha Nagamine,et al.  Exploring how deep neural networks form phonemic categories , 2015, INTERSPEECH.

[26]  Willard Van Orman Quine,et al.  Word and Object , 1960 .

[27]  R N Aslin,et al.  Statistical Learning by 8-Month-Old Infants , 1996, Science.

[28]  Jonathan G. Fiscus,et al.  Darpa Timit Acoustic-Phonetic Continuous Speech Corpus CD-ROM {TIMIT} | NIST , 1993 .

[29]  Aren Jansen,et al.  A segmental framework for fully-unsupervised large-vocabulary speech recognition , 2016, Comput. Speech Lang..

[30]  Toomas Altosaar,et al.  A Speech Corpus for Modeling Language Acquisition: CAREGIVER , 2010, LREC.

[31]  Elika Bergelson,et al.  Day by day, hour by hour: Naturalistic language input to infants. , 2018, Developmental science.

[32]  Pierre-Yves Oudeyer,et al.  MCA-NMF: Multimodal Concept Acquisition with Non-Negative Matrix Factorization , 2015, PloS one.

[33]  Anne Cutler,et al.  The role of strong syllables in segmentation for lexical access , 1988 .

[34]  William Schueller,et al.  Computational and Robotic Models of Early Language Development: A Review , 2019, International Handbook of Language Acquisition.

[35]  James R. Glass,et al.  Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input , 2018, ECCV.

[36]  Unto K. Laine,et al.  A method for noise-robust context-aware pattern discovery and recognition from categorical sequences , 2012, Pattern Recognit..

[37]  Michael C. Frank,et al.  A Bayesian Framework for Cross-Situational Word-Learning , 2007, NIPS.

[38]  Tetsuya Ogata,et al.  Dynamical Integration of Language and Behavior in a Recurrent Neural Network for Human–Robot Interaction , 2016, Front. Neurorobot..

[39]  Alexandre Bernardino,et al.  Language Bootstrapping: Learning Word Meanings From Perception–Action Association , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).