A Computational Model of Word Learning from Multimodal Sensory Input

How do infants segment continuous streams of speech to discover words of their language? Current theories emphasize the role of acoustic evidence in discovering word boundaries (Cutler 1991; Brent 1999; de Marcken 1996; Friederici & Wessels 1993; see also Bolinger & Gertsman 1957). To test an alternate hypothesis, we recorded natural infant-directed speech from caregivers engaged in play with their pre-linguistic infants centered around common objects. We also recorded the visual context in which the speech occurred by capturing images of these objects. We analyzed the data using two computational models, one of which processed only acoustic recordings, and a second model which integrated acoustic and visual input. The models were implemented using standard speech and vision processing techniques enabling the models to process sensory data. We show that using visual context in conjunction with spoken input dramatically improves learning when compared with using acoustic evidence alone. These results demonstrate the power of inter-modal learning and suggest that infants may use evidence from visual and other non-acoustic context to aid in speech segmentation and spoken word discovery.

[1]  M. Goldsmith,et al.  Statistical Learning by 8-Month-Old Infants , 1996 .

[2]  B. Julesz Foundations of Cyclopean Perception , 1971 .

[3]  Anthony J. Robinson,et al.  An application of recurrent nets to phone probability estimation , 1994, IEEE Trans. Neural Networks.

[4]  G. A. Miller THE PSYCHOLOGICAL REVIEW THE MAGICAL NUMBER SEVEN, PLUS OR MINUS TWO: SOME LIMITS ON OUR CAPACITY FOR PROCESSING INFORMATION 1 , 1956 .

[5]  Stephanie Seneff,et al.  Transcription and Alignment of the TIMIT Database , 1996 .

[6]  K. Stevens,et al.  Linguistic experience alters phonetic perception in infants by 6 months of age. , 1992, Science.

[7]  Carl de Marcken,et al.  Unsupervised language acquisition , 1996, ArXiv.

[8]  C. A. Ferguson,et al.  Talking to Children: Language Input and Acquisition , 1979 .

[9]  Alex Pentland,et al.  Learning words from sights and sounds: a computational model , 2002, Cogn. Sci..

[10]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[11]  J. Huttenlocher,et al.  Early word meanings: The case of object names , 1987, Cognitive Psychology.

[12]  Deb Roy,et al.  Toco the toucan: a synthetic character guided by perception, emotion, and story , 1997, SIGGRAPH '97.

[13]  J. Werker,et al.  Developmental changes across childhood in the perception of non-native speech sounds. , 1983, Canadian journal of psychology.

[14]  Bernt Schiele,et al.  Probabilistic object recognition using multidimensional receptive field histograms , 1996, Proceedings of 13th International Conference on Pattern Recognition.

[15]  A. Cutler Segmentation problems, rhythmic solutions * , 1994 .

[16]  Deb Roy,et al.  Integration of speech and vision using mutual information , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).