论文信息 - Learning words from natural audio-visual input

Learning words from natural audio-visual input

We present a model of early word learning which learns from natural audio and visual input. The model has been successfully implemented to learn words and their audio-visual grounding from camera and microphone input. Although simple in its current form, this model is a rst step towards a more complete, fully-grounded model of language acquisition. Practical applications include adaptive human-machine interfaces for information browsing, assistive technologies, education, and entertainment.

Alex Pentland | Deb Roy | A. Pentland | D. Roy

[1] J. Siskind. A computational study of cross-situational techniques for learning word-to-meaning mappings , 1996, Cognition.

[2] Susan T. Dumais,et al. The vocabulary problem in human-system communication , 1987, CACM.

[3] Richard Rose,et al. Word Spotting from Continuous Speech Utterances , 1996 .

[4] Bernt Schiele,et al. The Robustness of Object Recognition to View Point Changes Using Multidimensional Receptive Field Histograms , 1996 .

[5] Deb Roy,et al. Toco the toucan: a synthetic character guided by perception, emotion, and story , 1997, SIGGRAPH '97.

[6] Alex Pentland,et al. Word learning in a multimodal environment , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[7] Alex Pentland,et al. Multimodal Adaptive Interfaces , 1998 .